Cardiovascular Risk Factors: Examining Cardiovascular risk Using NHANES Pre-Pandemic Data

Author

Amr Salem

Modified

2024-12-12

1 Background

Cardiovascular disease (CVD) remains one of the leading causes of death globally, with cholesterol levels playing a critical role in assessing an individual’s risk. Cholesterol, particularly low-density lipoprotein (LDL) and high-density lipoprotein (HDL), is closely associated with the development of cardiovascular diseases. Understanding the relationship between various factors such as body mass index (BMI), smoking habits, hypertension, and age can provide valuable insights into how these risk factors contribute to elevated cholesterol levels and overall cardiovascular health.

The National Health and Nutrition Examination Survey (NHANES) provides an extensive dataset that can be used to explore the correlation between lifestyle factors, health indicators, and cholesterol levels. In this study, we examine NHANES pre-pandemic data, focusing on various health factors, including BMI, smoking, age, hypertension, and cholesterol levels. Our goal is to explore how these factors can be utilized to predict cholesterol risk, ultimately informing public health strategies and interventions.

By analyzing these relationships, we aim to better understand how different variables impact cholesterol levels and to identify the most significant predictors of cardiovascular risk. The results from this study can help in the development of targeted interventions and health guidelines for individuals at risk of developing heart disease.

This report reflects the analysis behind the dashboard, which presents the findings and insights from this study. Throughout the report, each analysis is accompanied by a subsection that discusses the choices made during the analysis process, including variable selection, modeling decisions, and the rationale behind visualizations displayed on the dashboard. These discussions aim to provide transparency into the methodology and ensure that the interpretation of the results is grounded in the data and analytical approach.

2 Setup and Ingest

knitr::opts_chunk$set(comment = NA)
library(haven)
library(caret)
library(broom)
library(dplyr)
library(mice)
library(ggplot2)
library(dplyr)
library(stats)
library(DescTools)
library(jtools)  
library(plotly)
library(gridExtra)  
library(readr)
library(DT)
library(tibble)
library(boot)   
library(stats)
library(tidyr)
library(knitr)
library(tidyr)

Importing the data:

P_TRIGLY <- read_xpt("P_TRIGLY.xpt")  # Triglycerides dataset
P_HDL <- read_xpt("P_HDL.xpt")        # HDL dataset
BMI_data <- read_xpt("P_BMX.xpt") #BMI
demog <- read_xpt("P_DEMO.xpt") #Demographic
chol_data <- read_xpt("P_TCHOL.xpt") # Cholestral total data
BP_data <- read_xpt("P_BPQ.xpt") # Blood Pressure
SMQ_data <- read_xpt("P_SMQ.xpt") # Smoking data
exercise_data <- read_xpt("P_PAQ.xpt")

The following code does some data extraction and cleaning.

# HDL vs LDL Analysis Tibble
HDL_vs_LDL <- full_join(P_HDL, P_TRIGLY, by = "SEQN") %>%
  select(SEQN, HDL = LBDHDD, LDL = LBXTR) %>%
  as_tibble()

# BMI and Gender Analysis Tibble
BMI_and_Gender <- full_join(BMI_data, demog, by = "SEQN") %>%
  select(SEQN, BMI = BMXBMI, Gender = RIAGENDR) %>%
  mutate(Gender = recode(Gender, `1` = "Male", `2` = "Female")) %>%
  mutate(Gender = factor(Gender, levels = c("Male", "Female"))) %>%
  as_tibble()

# Cholesterol & Age Groups Analysis Tibble
Cholesterol_AgeGroups <- full_join(chol_data, demog, by = "SEQN") %>%
  select(SEQN, Total_Cholesterol = LBXTC, Age = RIDAGEYR) %>%
  mutate(Age_Group = case_when(
    Age < 20 ~ "<20",
    Age >= 20 & Age < 30 ~ "20-29",
    Age >= 30 & Age < 40 ~ "30-39",
    Age >= 40 & Age < 50 ~ "40-49",
    Age >= 50 & Age < 60 ~ "50-59",
    Age >= 60 & Age < 70 ~ "60-69",
    Age >= 70 & Age < 80 ~ "70-79",
    Age >= 80 ~ "80+"
  )) %>%
  mutate(Age_Group = factor(Age_Group, 
                            levels = c("<20", "20-29", "30-39", "40-49", "50-59", "60-69", "70-79", "80+"))) %>%
  as_tibble()

# Smoking & Hypertension Analysis Tibble
Smoking_Hypertension <- full_join(SMQ_data, BP_data, by = "SEQN") %>%
  select(SEQN, Smoking_Status = SMQ020, Hypertension_Status = BPQ020) %>%
  mutate(
    Smoking_Status = recode(
      Smoking_Status,
      `1` = "Smoker",
      `2` = "Non-Smoker",
      .default = "Unknown"
    ),
    Hypertension_Status = recode(
      Hypertension_Status,
      `1` = "Yes",
      `2` = "No",
      .default = "Unknown"
    )
  ) %>%
  mutate(
    Smoking_Status = factor(Smoking_Status, levels = c("Smoker", "Non-Smoker")),
    Hypertension_Status = factor(Hypertension_Status, levels = c("Yes", "No"))
  ) %>%
  as_tibble()

# Exercise Data Clean-Up
exercise_data_clean <- exercise_data %>%
  filter(PAQ655 != 77, PAQ655 != 99, !is.na(PAQ655)) %>%
  mutate(PAQ655 = as.integer(PAQ655)) %>%
  rename(vigorous_activity_days = PAQ655)

The following code summarizes the data.

# Check missing data for each tibble
summary(HDL_vs_LDL)  # Check missing in HDL_vs_LDL
      SEQN             HDL              LDL        
 Min.   :109264   Min.   :  5.00   Min.   :  10.0  
 1st Qu.:113178   1st Qu.: 43.00   1st Qu.:  56.0  
 Median :117098   Median : 51.00   Median :  84.0  
 Mean   :117083   Mean   : 53.47   Mean   : 103.7  
 3rd Qu.:120995   3rd Qu.: 61.00   3rd Qu.: 126.0  
 Max.   :124822   Max.   :189.00   Max.   :2684.0  
                  NA's   :1370     NA's   :7548    
summary(BMI_and_Gender)  # Check missing in BMI_and_Gender
      SEQN             BMI           Gender    
 Min.   :109263   Min.   :11.90   Male  :7721  
 1st Qu.:113153   1st Qu.:20.40   Female:7839  
 Median :117042   Median :25.80                
 Mean   :117042   Mean   :26.66                
 3rd Qu.:120932   3rd Qu.:31.40                
 Max.   :124822   Max.   :92.30                
                  NA's   :2423                 
summary(Cholesterol_AgeGroups)  # Check missing in Cholesterol_AgeGroups
      SEQN        Total_Cholesterol      Age          Age_Group   
 Min.   :109263   Min.   : 71.0     Min.   : 0.00   <20    :6328  
 1st Qu.:113153   1st Qu.:149.0     1st Qu.:10.00   60-69  :1746  
 Median :117042   Median :173.0     Median :30.00   50-59  :1565  
 Mean   :117042   Mean   :177.5     Mean   :33.74   40-49  :1446  
 3rd Qu.:120932   3rd Qu.:201.0     3rd Qu.:56.00   30-39  :1421  
 Max.   :124822   Max.   :446.0     Max.   :80.00   20-29  :1378  
                  NA's   :4732                      (Other):1676  
summary(Smoking_Hypertension)  # Check missing in Smoking_Hypertension
      SEQN           Smoking_Status Hypertension_Status
 Min.   :109264   Smoker    :3889   Yes :3597          
 1st Qu.:113182   Non-Smoker:5799   No  :6586          
 Median :117058   NA's      :1471   NA's: 976          
 Mean   :117066                                        
 3rd Qu.:120958                                        
 Max.   :124822                                        
summary(exercise_data_clean)  # Check missing in exercise_data_clean
      SEQN            PAQ605          PAQ610           PAD615     
 Min.   :109266   Min.   :1.000   Min.   : 1.000   Min.   : 10.0  
 1st Qu.:113078   1st Qu.:1.000   1st Qu.: 3.000   1st Qu.: 60.0  
 Median :117046   Median :2.000   Median : 5.000   Median :180.0  
 Mean   :117075   Mean   :1.678   Mean   : 4.545   Mean   :210.9  
 3rd Qu.:121078   3rd Qu.:2.000   3rd Qu.: 5.000   3rd Qu.:300.0  
 Max.   :124822   Max.   :2.000   Max.   :99.000   Max.   :840.0  
                                  NA's   :1641     NA's   :1647   
     PAQ620          PAQ625           PAD630         PAQ635     
 Min.   :1.000   Min.   : 1.000   Min.   :  10   Min.   :1.000  
 1st Qu.:1.000   1st Qu.: 3.000   1st Qu.:  60   1st Qu.:1.000  
 Median :1.000   Median : 5.000   Median : 120   Median :2.000  
 Mean   :1.501   Mean   : 4.594   Mean   : 213   Mean   :1.708  
 3rd Qu.:2.000   3rd Qu.: 5.000   3rd Qu.: 240   3rd Qu.:2.000  
 Max.   :9.000   Max.   :99.000   Max.   :9999   Max.   :2.000  
                 NA's   :1206     NA's   :1214                  
     PAQ640          PAD645           PAQ650  vigorous_activity_days
 Min.   :1.000   Min.   : 10.00   Min.   :1   Min.   :1.0           
 1st Qu.:3.000   1st Qu.: 20.00   1st Qu.:1   1st Qu.:2.0           
 Median :5.000   Median : 30.00   Median :1   Median :3.0           
 Mean   :4.657   Mean   : 58.55   Mean   :1   Mean   :3.4           
 3rd Qu.:7.000   3rd Qu.: 60.00   3rd Qu.:1   3rd Qu.:5.0           
 Max.   :7.000   Max.   :480.00   Max.   :1   Max.   :7.0           
 NA's   :1715    NA's   :1717                                       
     PAD660            PAQ665          PAQ670          PAD675       
 Min.   :  10.00   Min.   :1.000   Min.   :1.000   Min.   :  10.00  
 1st Qu.:  40.00   1st Qu.:1.000   1st Qu.:2.000   1st Qu.:  30.00  
 Median :  60.00   Median :1.000   Median :3.000   Median :  60.00  
 Mean   :  84.64   Mean   :1.296   Mean   :3.498   Mean   :  71.37  
 3rd Qu.: 120.00   3rd Qu.:2.000   3rd Qu.:5.000   3rd Qu.:  60.00  
 Max.   :9999.00   Max.   :2.000   Max.   :7.000   Max.   :9999.00  
 NA's   :4                         NA's   :716     NA's   :719      
     PAD680      
 Min.   :   2.0  
 1st Qu.: 180.0  
 Median : 300.0  
 Mean   : 362.2  
 3rd Qu.: 480.0  
 Max.   :9999.0  
 NA's   :4       

In this analysis, I decided to drop the missing values from the dataset. Given the nature of the NHANES study, where missing data is often due to individuals not attending appointments or other non-systematic reasons, I determined that the missing values would not significantly bias the results. Dropping these values simplifies the analysis, ensuring that the remaining data is complete and reliable for the statistical tests and modeling I plan to perform. Additionally, with a large sample size, the loss of a small portion of data should not notably affect the overall findings. By opting to drop the missing values, I maintain a straightforward and interpretable approach, which allows for clear and actionable insights from the data.

# Drop rows with missing values for HDL, LDL, BMI, and other relevant variables
# Decision 1: Drop rows with missing values for HDL and LDL
HDL_vs_LDL <- HDL_vs_LDL %>%
  filter(!is.na(HDL), !is.na(LDL)) %>%
  as_tibble()

# Decision 2: Drop rows with missing values for BMI and Gender
BMI_and_Gender <- BMI_and_Gender %>%
  filter(!is.na(BMI), !is.na(Gender)) %>%
  as_tibble()

# Decision 3: Drop rows with missing values for Total_Cholesterol and Age_Group
Cholesterol_AgeGroups <- Cholesterol_AgeGroups %>%
  filter(!is.na(Total_Cholesterol), !is.na(Age_Group)) %>%
  as_tibble()

# Decision 4: Drop rows with "Unknown" values for Smoking_Status and Hypertension_Status
Smoking_Hypertension <- Smoking_Hypertension %>%
  filter(Smoking_Status != "Unknown", Hypertension_Status != "Unknown") %>%
  as_tibble()

# Decision 5: Clean exercise data by removing "77", "99", and NAs
exercise_data_clean <- exercise_data_clean %>%
  filter(!is.na(vigorous_activity_days)) %>%
  as_tibble()

3 Codebook and Data description

The following code sets the tibbles that I will be using in all the analysis.

# Count the number of unique participants (SEQN) for each dataset
hdld_count <- HDL_vs_LDL %>%
  summarise(HDL_LDL_participants = n_distinct(SEQN))

bmi_and_gender_count <- BMI_and_Gender %>%
  summarise(BMI_Gender_participants = n_distinct(SEQN))

cholesterol_age_count <- Cholesterol_AgeGroups %>%
  summarise(Cholesterol_AgeGroups_participants = n_distinct(SEQN))

exercise_count <- exercise_data_clean %>%
  summarise(Exercise_participants = n_distinct(SEQN))

smoking_hypertension_count <- Smoking_Hypertension %>%
  summarise(Smoking_Hypertension_participants = n_distinct(SEQN))


# Merge the counts from all datasets
study_participation <- bind_rows(
  tibble(Study = "HDL vs LDL", Participants = hdld_count$HDL_LDL_participants),
  tibble(Study = "BMI and Gender", Participants = bmi_and_gender_count$BMI_Gender_participants),
  tibble(Study = "Cholesterol and Age Groups", Participants = cholesterol_age_count$Cholesterol_AgeGroups_participants),
  tibble(Study = "Exercise Data", Participants = exercise_count$Exercise_participants),
  tibble(Study = "Smoking and Hypertension", Participants = smoking_hypertension_count$Smoking_Hypertension_participants)
)
study_participation
# A tibble: 5 × 2
  Study                      Participants
  <chr>                             <int>
1 HDL vs LDL                         4650
2 BMI and Gender                    13137
3 Cholesterol and Age Groups        10828
4 Exercise Data                      2421
5 Smoking and Hypertension           9676

3.1 Codebook

# Create the codebook
codebook <- tibble::tibble(
  Variable_Name = c(
    "SEQN", 
    "LDL", "HDL", 
    "BMI", "Gender", 
    "Total_Cholesterol", "Age_Group", 
    "Smoking_Status", "Hypertension_Status","vigorous_activity_days"
  ),
  Variable_Type = c(
    "Identifier", 
    "Quant", "Quant", 
    "Quant", "Binary", 
    "Quant", "8-cat", 
    "Binary", "Binary", "Quant"
  ),
  Original_Name = c(
    "SEQN", 
    "LBXTR", "LBDHDD", 
    "BMXBMI", "RIAGENDR", 
    "LBXTC", "RIDAGEYR", 
    "SMQ020", "BPQ020","PAQ655"
  )
)

# Print the codebook
suppressWarnings({
  knitr::kable(
    codebook,
    col.names = c("Variable Name", "Variable Type", "Original Name"),
    caption = "Codebook for Variables Used in Analyses"
  )
})
Codebook for Variables Used in Analyses
Variable Name Variable Type Original Name
SEQN Identifier SEQN
LDL Quant LBXTR
HDL Quant LBDHDD
BMI Quant BMXBMI
Gender Binary RIAGENDR
Total_Cholesterol Quant LBXTC
Age_Group 8-cat RIDAGEYR
Smoking_Status Binary SMQ020
Hypertension_Status Binary BPQ020
vigorous_activity_days Quant PAQ655

4 HDL vs LDL analysis

4.1 The Question

The relationship between HDL (high-density lipoprotein) and LDL (low-density lipoprotein) cholesterol levels is of particular interest in understanding cardiovascular health. HDL is often referred to as “good cholesterol” due to its role in transporting cholesterol away from the arteries, while LDL is called “bad cholesterol” because high levels can lead to plaque buildup in arteries.

Pre-existing Belief: Before analyzing the data, it is expected that there is an inverse relationship between HDL and LDL levels. That is, individuals with higher HDL levels tend to have lower LDL levels, as HDL is thought to counteract the effects of LDL.

Research Question: Is there a significant difference between high-density lipoprotein (HDL) cholesterol and low-density lipoprotein (LDL) cholesterol levels?

4.2 Data Description

# Create the codebook table
data_description <- data.frame(
  Variable_Name = c("HDL_Level", "LDL_Level", "Age_Group", "Gender"),
  Description = c("High-density lipoprotein cholesterol (mg/dL)",
                  "Low-density lipoprotein cholesterol (mg/dL)",
                  "Age categorized into groups",
                  "Gender of the participant"),
  Type = c("Quantitative", "Quantitative", "8-Cat", "Binary"),
  Original_Variable_Name = c("LBDHDD", "LBXLDL", "RIDAGEYR", "RIAGENDR")
)

# View the data description table
data_description
  Variable_Name                                  Description         Type
1     HDL_Level High-density lipoprotein cholesterol (mg/dL) Quantitative
2     LDL_Level  Low-density lipoprotein cholesterol (mg/dL) Quantitative
3     Age_Group                  Age categorized into groups        8-Cat
4        Gender                    Gender of the participant       Binary
  Original_Variable_Name
1                 LBDHDD
2                 LBXLDL
3               RIDAGEYR
4               RIAGENDR

The following is a summary of the data:

# Numeric summaries for HDL and LDL
summary_stats <- HDL_vs_LDL %>%
  summarise(
    HDL_Min = min(HDL, na.rm = TRUE),
    HDL_Max = max(HDL, na.rm = TRUE),
    HDL_Mean = mean(HDL, na.rm = TRUE),
    HDL_Median = median(HDL, na.rm = TRUE),
    HDL_SD = sd(HDL, na.rm = TRUE),
    LDL_Min = min(LDL, na.rm = TRUE),
    LDL_Max = max(LDL, na.rm = TRUE),
    LDL_Mean = mean(LDL, na.rm = TRUE),
    LDL_Median = median(LDL, na.rm = TRUE),
    LDL_SD = sd(LDL, na.rm = TRUE)
  )

# View the summary statistics
pivoted_stats <- summary_stats %>%
  pivot_longer(cols = everything(), 
               names_to = c("Type", "Statistic"), 
               names_sep = "_") %>%
  pivot_wider(names_from = "Type", values_from = "value")

# View the pivoted statistics
pivoted_stats# View the summary statistics
# A tibble: 5 × 3
  Statistic   HDL    LDL
  <chr>     <dbl>  <dbl>
1 Min        11     10  
2 Max       187   2684  
3 Mean       53.6  104. 
4 Median     51     84  
5 SD         15.5   89.8

The following is the density plot of HDL vs LDL to examine the data.

# Step 10: Density plot with log transformation and legend
ggplot(HDL_vs_LDL) +
  # Density plot for log-transformed Triglycerides (LBXTR)
  geom_density(aes(x = (LDL), fill = "Triglycerides"), color = "black", alpha = 0.6) +
  # Density plot for log-transformed HDL Cholesterol (LBDHDD)
  geom_density(aes(x = (HDL), fill = "HDL Cholesterol"), color = "black", alpha = 0.6) +
  labs(
    title = "Density Plot of Log-Transformed Triglycerides and HDL Cholesterol",
    x = "Log(Concentration)",
    y = "Density"
  ) +
  scale_fill_manual(name = "Variable", values = c("Triglycerides" = "skyblue", "HDL Cholesterol" = "lightgreen")) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5),
    axis.title = element_text(size = 10),
    axis.text = element_text(size = 8),
    legend.title = element_text(size = 7),
    legend.text = element_text(size = 7)
  )

I decided to apply a log transformation to both HDL and LDL cholesterol levels was primarily driven by the observed skewness in the original data distributions. Cholesterol levels, particularly in population studies, often follow a right-skewed distribution, meaning that a majority of the values are clustered at lower levels, with a long tail extending towards higher values.

The following is the density plot after transformation:

ggplot(HDL_vs_LDL) +
  geom_density(aes(x = log(LDL), fill = "Triglycerides"), color = "black", alpha = 0.6) +

    geom_density(aes(x = log(HDL), fill = "HDL Cholesterol"), color = "black", alpha = 0.6) +
  labs(
    title = "Density Plot of Log-Transformed Triglycerides and HDL Cholesterol",
    x = "Log(Concentration)",
    y = "Density"
  ) +
  scale_fill_manual(name = "Variable", values = c("Triglycerides" = "skyblue", "HDL Cholesterol" = "lightgreen")) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5),
    axis.title = element_text(size = 10),
    axis.text = element_text(size = 8),
    legend.title = element_text(size = 7),
    legend.text = element_text(size = 7)
  )

write.csv(HDL_vs_LDL,"hdlvsldl.csv", row.names = FALSE)

4.3 Main Analysis

Hypotheses: Null Hypothesis (H₀): There is no significant difference between HDL and LDL cholesterol levels (i.e., the mean difference between HDL and LDL levels is zero). Alternative Hypothesis (H₁): There is a significant difference between HDL and LDL cholesterol levels (i.e., the mean difference between HDL and LDL levels is not zero).

Before conducting this analysis, I have a pre-existing belief based on general medical knowledge that HDL and LDL levels are likely to be different in most populations due to their differing roles in cardiovascular health—HDL is often referred to as “good” cholesterol, while LDL is termed “bad” cholesterol.

Choice of Statistical Test:

For this analysis, I chose to perform a paired t-test, which is appropriate when comparing the means of two related groups (in this case, HDL and LDL cholesterol levels for the same subjects). The paired t-test is used to assess whether the difference in cholesterol levels between the two measures is statistically significant.

The decision to use a paired t-test was made because:

Paired Data: The same individuals provide measurements for both HDL and LDL cholesterol, making the data “paired.” This allows us to examine whether there is a difference in cholesterol levels between the two types within the same subjects.

Assumptions for the Paired t-Test: Paired Data: The data consists of paired observations, where each participant’s HDL and LDL levels are measured.

Scale of Measurement: Both HDL and LDL cholesterol levels are measured on an interval scale (mg/dL), making the use of a t-test appropriate.

# Perform paired t-test
ttest_result <- t.test(HDL_vs_LDL$LDL, HDL_vs_LDL$HDL, paired = TRUE)

# Print the result of the t-test
ttest_result

    Paired t-test

data:  HDL_vs_LDL$LDL and HDL_vs_LDL$HDL
t = 35.582, df = 4649, p-value < 2.2e-16
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 47.36787 52.89192
sample estimates:
mean difference 
       50.12989 

4.4 Conclusion

Answer to the Research Question: The research question was: “Is there a significant difference between Triglyceride (LBXTR) and HDL Cholesterol (LBDHDD) levels in the study population?”

Based on the results of the paired t-test, we can conclude that there is a statistically significant difference between the triglyceride and HDL cholesterol levels. The test statistic of 35.582 and the p-value is well below the significance threshold of 0.05. This means that we reject the null hypothesis that there is no difference between triglyceride and HDL cholesterol levels.

95% Confidence Interval:

The 95% confidence interval for the mean difference in levels between triglyceride and HDL cholesterol is between 47.37 and 52.89. This suggests that, on average, triglyceride levels are significantly higher than HDL cholesterol levels in the study population.

Pre-existing Belief Reflection:

My belief that triglyc eride levels would be different than HDL cholesterol levels based on prior knowledge of lipid profiles, the data supports this hypothesis. The mean difference between triglyceride and HDL cholesterol levels is 50.13. This aligns with the belief that triglycerides are generally higher than HDL in the studied population.

Logical Next Steps: The next step could involve exploring the causal relationship between triglycerides and HDL cholesterol. For example, a regression model could be used to examine how various predictors, such as age, gender, smoking status, and BMI, influence the relationship between triglyceride and HDL cholesterol levels. Additionally, this analysis could be extended to investigate whether individuals with higher triglyceride levels also show an increased risk for cardiovascular diseases.

Lastly, future studies might want to explore other potential interactions between lipid profiles and other biomarkers, such as blood pressure or inflammatory markers, to provide a more comprehensive understanding of the cardiovascular health risks in the population.

4.5 Reasoning for Dashboard

The goal of the dashboard is to provide clear and easily interpretable visualizations and key performance indicators (KPIs) that can help users quickly understand the relationship between triglyceride and HDL cholesterol levels, as well as assess the quality of lipid profiles in the study population.

Chosen Visualizations and KPIs Mean LDL-to-HDL Ratio:

Why Chosen: The LDL-to-HDL ratio is a commonly used metric to assess lipid balance, which plays a crucial role in cardiovascular health. A higher ratio suggests a higher risk of heart disease, making it a valuable indicator for monitoring lipid profiles in the population. Rationale for KPI: The mean ratio provides a quick, overall assessment of the population’s lipid balance. By comparing this value across different subgroups or over time, we can track trends and identify potential areas of concern.

Implementation:

list(
  icon = "arrow-down-up",
  color = "#ffe4b2",
  value = round(mean(HDL_vs_LDL$LDL / HDL_vs_LDL$HDL, na.rm = TRUE), 2)
)
$icon
[1] "arrow-down-up"

$color
[1] "#ffe4b2"

$value
[1] 2.29

Healthy Lipid Profiles (%):

Why Chosen: Healthy lipid profiles are typically defined by triglyceride levels under 100 mg/dL and HDL cholesterol levels above 60 mg/dL. This KPI reflects the proportion of individuals in the dataset who meet these criteria, providing insight into the overall cardiovascular health of the population. Rationale for KPI: Understanding the percentage of individuals with a healthy lipid profile allows stakeholders to evaluate the general health of the group, identify trends, and determine the need for public health interventions.

Implementation:

list(
  icon = "heart",
  color = "green",
  value = round(
    mean(
      HDL_vs_LDL$LDL < 100 & HDL_vs_LDL$HDL >= 60, na.rm = TRUE
    ) * 100, 
    1
  )
)
$icon
[1] "heart"

$color
[1] "green"

$value
[1] 23.9

Histograms of Triglycerides (LBXTR) and HDL Cholesterol (LBDHDD):

Why Chosen: Histograms provide an overview of the distribution of triglyceride and HDL cholesterol values in the population. These visualizations help identify patterns such as skewness, potential outliers, and the overall spread of the data. Rationale for Visualization: By examining the distribution of these lipid measures, we can better understand the population’s lipid profiles and identify areas where the data might need cleaning or further investigation (e.g., removing outliers).

Density Plot Comparison of Triglycerides and HDL Cholesterol:

Why Chosen: A density plot allows for a smooth estimation of the distribution of triglyceride and HDL cholesterol levels, making it easier to compare their shapes and spread. The log transformation is applied to both variables to reduce skewness and facilitate a more accurate comparison. Rationale for Visualization: This plot helps visualize the overlap (or lack thereof) between the two distributions, providing a deeper understanding of how triglycerides and HDL cholesterol differ in terms of their concentration distributions across the population.

5 BMI and Gender

5.1 The Question

Research Question

Is there a difference in BMI distribution between males and females?

Description of Study:

In this analysis, I want to explore the relationship between Body Mass Index (BMI) and gender. BMI is a key measure commonly used to assess whether individuals are underweight, normal weight, overweight, or obese, based on their height and weight. Given that BMI can influence health outcomes like cardiovascular risk, diabetes, and overall morbidity, understanding any potential gender differences in BMI distributions is important for targeted health interventions.

Pre-existing Belief:

Before examining the data, I believe that there will be a noticeable difference in BMI between genders, with women generally having higher average BMI values than men. This belief is based on well-established biological differences in body composition between the sexes. Specifically, women tend to have a higher percentage of body fat compared to men, who often have more muscle mass. As BMI is a function of both weight and height, and since muscle mass weighs more than fat, this could explain why women might have higher BMI values on average, even if they are at similar or slightly lower body weights compared to men. I expect this difference to be statistically significant based on prior research, though the magnitude of the difference will need to be confirmed through the data analysis.

5.2 Data Descrption

# Create the codebook table
data_description_bmi_gender <- data.frame(
  Variable_Name = c("BMI", "Gender"),
  Description = c("Body Mass Index, calculated as weight (kg) / height (m)^2",
                  "Gender of the participant (Male/Female)"),
  Type = c("Quantitative", "Binary"),
  Original_Variable_Name = c("BMXBMI", "RIAGENDR")
)

# View the data description table
data_description_bmi_gender
  Variable_Name                                               Description
1           BMI Body Mass Index, calculated as weight (kg) / height (m)^2
2        Gender                   Gender of the participant (Male/Female)
          Type Original_Variable_Name
1 Quantitative                 BMXBMI
2       Binary               RIAGENDR

The data summaries:

# Numeric summaries for BMI and Gender
summary_stats_bmi_gender <- BMI_and_Gender %>%
  summarise(
    BMI_Min = min(BMI, na.rm = TRUE),
    BMI_Max = max(BMI, na.rm = TRUE),
    BMI_Mean = mean(BMI, na.rm = TRUE),
    BMI_Median = median(BMI, na.rm = TRUE),
    BMI_SD = sd(BMI, na.rm = TRUE),
    Gender_Male_Percentage = mean(Gender == "Male", na.rm = TRUE) * 100,  # assuming 1 is Male
    Gender_Female_Percentage = mean(Gender == "Female", na.rm = TRUE) * 100   # assuming 2 is Female
  )

write.csv(BMI_and_Gender,"BMI_data.csv", row.names = FALSE)
# View the pivoted statistics
summary_stats_bmi_gender
# A tibble: 1 × 7
  BMI_Min BMI_Max BMI_Mean BMI_Median BMI_SD Gender_Male_Percentage
    <dbl>   <dbl>    <dbl>      <dbl>  <dbl>                  <dbl>
1    11.9    92.3     26.7       25.8   8.42                   49.2
# ℹ 1 more variable: Gender_Female_Percentage <dbl>

I used a violin plot and examine the data:

# Boxplot to compare BMI between genders
boxplot_bmi <- ggplot(BMI_and_Gender, aes(x = Gender, y = BMI, fill = Gender)) +
  geom_boxplot() +
  labs(title = "BMI Distribution by Gender", x = "Gender", y = "BMI") +
  theme_minimal()
boxplot_bmi

There appear to be a lot of outliers in the violin plot, so I have removed them to get a clearer view of the data.

# Remove outliers based on the IQR (Interquartile Range) method
Q1 <- quantile(BMI_and_Gender$BMI, 0.25, na.rm = TRUE)
Q3 <- quantile(BMI_and_Gender$BMI, 0.75, na.rm = TRUE)
IQR_value <- Q3 - Q1
lower_limit <- Q1 - 1.5 * IQR_value
upper_limit <- Q3 + 1.5 * IQR_value

# Filter out the outliers
BMI_and_Gender_no_outliers <- BMI_and_Gender %>%
  filter(BMI >= lower_limit & BMI <= upper_limit)

# Boxplot without outliers
boxplot_bmi_no_outliers <- ggplot(BMI_and_Gender_no_outliers, aes(x = Gender, y = BMI, fill = Gender)) +
  geom_boxplot() +
  labs(title = "BMI Distribution by Gender (Without Outliers)", x = "Gender", y = "BMI") +
  theme_minimal()

# Show the plot
boxplot_bmi_no_outliers

Females seem to have a higher BMI than males. I will processed with the analysis to examine this relationship closer.

The following code plots histograms so I can examine the data on the gender-level.

histogram_bmi <- ggplot(BMI_and_Gender, aes(x = BMI, fill = Gender)) +
  geom_histogram(binwidth = 1, alpha = 0.6, position = "identity") +  # Adjust binwidth as needed
  facet_wrap(~ Gender, scales = "free_y") +  # Facet by Gender
  labs(title = "BMI Distribution by Gender", x = "BMI", y = "Frequency") +
  theme_minimal()

histogram_bmi

The following is the Q-Q plot of BMI.

ggplot(BMI_and_Gender, aes(sample = BMI)) +
  stat_qq() +
  stat_qq_line(color = "red") +
  facet_wrap(~ Gender, scales = "free_y") +
  labs(title = "Q-Q Plot of  BMI by Gender", x = "Theoretical Quantiles", y = "Sample Quantiles") +
  theme_minimal()

The data seems skewed to the right so I will plot log(BMI):

histogram_bmi <- ggplot(BMI_and_Gender, aes(x = log(BMI+1), fill = Gender)) +
  geom_histogram(binwidth = 0.1, alpha = 0.6, position = "identity") +  # Adjust binwidth as needed
  facet_wrap(~ Gender, scales = "free_y") +  # Facet by Gender
  labs(title = "BMI Distribution by Gender", x = "BMI", y = "Frequency") +
  theme_minimal()

histogram_bmi

The following is the Q-Q plot of log(BMI).

BMI_and_Gender$BMI_log <- log(BMI_and_Gender$BMI +1)
ggplot(BMI_and_Gender, aes(sample = BMI_log)) +
  stat_qq() +
  stat_qq_line(color = "red") +
  facet_wrap(~ Gender, scales = "free_y") +
  labs(title = "Q-Q Plot of Log-Transformed BMI by Gender", x = "Theoretical Quantiles", y = "Sample Quantiles") +
  theme_minimal()

Although there appears to be a slight skew in the data, it generally looks appropriate to proceed with the analysis.

5.3 Main analysis

Hypotheses: Null Hypothesis (H₀): There is no difference in mean BMI between genders. ( 𝜇male =𝜇female) Alternative Hypothesis (H₁): There is a difference in mean BMI between genders. (𝜇male ≠𝜇female

Choice of Statistical Test: The decision to use an independent two-sample t-test was based on the following factors:

The research question aims to compare the means of BMI between two independent groups: males and females.

The data is continuous (BMI values) and normally distributed within each gender group, which justifies the use of a t-test. Please, refer to the dashboard to examine histograms.

The two-sample t-test is appropriate for comparing the means of two groups when the assumption of normality is met, and the data from each group is independent of the other.

t_test_result <- t.test(BMI_log ~ Gender, data = BMI_and_Gender)
t_test_result

    Welch Two Sample t-test

data:  BMI_log by Gender
t = -6.2136, df = 13069, p-value = 5.335e-10
alternative hypothesis: true difference in means between group Male and group Female is not equal to 0
95 percent confidence interval:
 -0.04199041 -0.02185101
sample estimates:
  mean in group Male mean in group Female 
            3.259754             3.291675 

The following code back-transforms the results to make sense of it:

lower_bound_backtransformed <- exp(-0.04199041)
upper_bound_backtransformed <- exp(-0.02185101)

# Print the back-transformed confidence interval
lower_bound_backtransformed
[1] 0.958879
upper_bound_backtransformed
[1] 0.978386
mean_male_backtransformed <- exp(3.259754)
mean_female_backtransformed <- exp(3.291675)
mean_male_backtransformed
[1] 26.04313
mean_female_backtransformed
[1] 26.88786

5.4 Conclusion

Based on the results of the Welch Two Sample t-test, there is a statistically significant difference in the mean BMI between females and males (t = -6.2136, df = 13069, p-value = 5.335e-10). The 95% confidence interval for the difference in means between the two groups ranges from 0.958879 to 0.978386, suggesting that the true difference in log(BMI) between females and males is likely to fall within this range.

The back-transformed geometric mean BMI for females is 26.88786, while the mean BMI for males is26.04313. This indicates that, on average, females have a higher BMI than males in this dataset. The very small p-value (much less than 0.05) strongly supports the rejection of the null hypothesis that there is no difference in BMI between genders.

Next Steps: Given the significant difference in BMI between genders, it may be worthwhile to further investigate potential factors that contribute to this disparity, such as age, lifestyle, or underlying health conditions. Additionally, exploring the data with different variables or additional stratifications could provide more insights into the reasons behind the observed differences.

5.5 Reasoning for Dashboard

Mean BMI for Women and Men: The value boxes for the mean BMI of women and men were chosen to provide a quick summary of the average BMI for each gender group in the dataset. This allows users to easily compare the overall BMI levels between genders. The icons representing “gender-female” and “gender-male” along with the color choices (pink for women and blue for men) offer a clear visual distinction between the two groups. The rounding of the BMI values to two decimal places ensures the information is concise and easy to read while maintaining sufficient precision for comparison.

By presenting the mean BMI separately for women and men, this visualization immediately communicates the central tendency for both groups, which is useful for understanding gender-related health trends in the dataset. It is also important to highlight that the mean BMI can be used as a basic indicator of population health, and any deviations or comparisons between groups can help identify potential health issues or areas for further analysis.

Violin Plot: The addition of a violin plot was chosen to provide a more detailed visual representation of the distribution of BMI values for both women and men. Unlike the boxplot, which only shows summary statistics (e.g., quartiles, median), the violin plot also depicts the density of the BMI distribution across genders, offering a better understanding of the data spread and potential skewness.

The violin plot is particularly useful for comparing the shape of the distribution between women and men, showing whether there are any differences in the variability, central tendency, or the presence of outliers in the BMI values across the two gender groups. This visualization complements the mean BMI values by giving a deeper look into the data distribution.

6 Cholesterol and Age Groups

In this analysis, I aim to explore the relationship between cholesterol levels and age groups. Specifically, I am interested in examining whether there are significant differences in cholesterol levels across various age categories. This could provide insight into how age influences cholesterol, which is a known risk factor for cardiovascular diseases.

Research Question: Does cholesterol level differ significantly across different age groups?

Pre-existing Belief: I believe that as age increases, cholesterol levels tend to rise, as is commonly observed in epidemiological studies. It is often hypothesized that older individuals may have higher cholesterol levels due to metabolic changes that occur with aging. Therefore, I expect to see higher average cholesterol levels in older age groups compared to younger ones. However, this assumption needs to be verified through data analysis.

6.1 Data Descrption

# Create the codebook table for Cholesterol and Age Groups
data_description_cholesterol_age <- data.frame(
  Variable_Name = c("Cholesterol", "Age_Group"),
  Description = c("Total cholesterol level (mg/dL)",
                  "Age categorized into groups (e.g., 18-29, 30-39, etc.)"),
  Type = c("Quantitative", "Categorical"),
  Original_Variable_Name = c("LBXTC", "RIDAGEYR")
)

# View the data description table
data_description_cholesterol_age
  Variable_Name                                            Description
1   Cholesterol                        Total cholesterol level (mg/dL)
2     Age_Group Age categorized into groups (e.g., 18-29, 30-39, etc.)
          Type Original_Variable_Name
1 Quantitative                  LBXTC
2  Categorical               RIDAGEYR
# Numeric summaries for Cholesterol and Age Group
summary_stats_cholesterol_age <- Cholesterol_AgeGroups %>%
  group_by(Age_Group)%>%
  summarise(
    Cholesterol_Min = min(Total_Cholesterol, na.rm = TRUE),
    Cholesterol_Max = max(Total_Cholesterol, na.rm = TRUE),
    Cholesterol_Mean = mean(Total_Cholesterol, na.rm = TRUE),
    Cholesterol_Median = median(Total_Cholesterol, na.rm = TRUE),
    Cholesterol_SD = sd(Total_Cholesterol, na.rm = TRUE)
  )

# View the summary statistics for Cholesterol and Age Group
summary_stats_cholesterol_age
# A tibble: 8 × 6
  Age_Group Cholesterol_Min Cholesterol_Max Cholesterol_Mean Cholesterol_Median
  <fct>               <dbl>           <dbl>            <dbl>              <dbl>
1 <20                    73             322             155.                153
2 20-29                  84             416             171.                166
3 30-39                  87             384             184.                180
4 40-49                  84             431             193.                191
5 50-59                  94             446             198.                195
6 60-69                  76             365             188.                185
7 70-79                  71             428             180.                175
8 80+                    80             315             176.                172
# ℹ 1 more variable: Cholesterol_SD <dbl>
write.csv(Cholesterol_AgeGroups,"CholesterolLevel_df.csv", row.names = FALSE)
ggplot(Cholesterol_AgeGroups, aes(x = Age_Group, y = Total_Cholesterol)) +
  geom_boxplot(fill = "lightblue", color = "black", outlier.shape = 16, outlier.colour = "red") +
  theme_minimal() +
  labs(title = "Cholesterol Levels by Age Group", x = "Age Group", y = "Cholesterol Level")

It does seem that cholesterol levels change across age groups. The data suggests that cholesterol levels peak in middle age and then decrease afterwards.

The following code plots a histogram for each group to examine the distribution

ggplot(Cholesterol_AgeGroups, aes(x = log(Total_Cholesterol))) +
  geom_histogram(binwidth = 0.1, color = "black", fill = "lightblue", alpha = 0.7) +
  facet_wrap(~ Age_Group, scales = "free_y") +  # Creates separate plots for each age group
  theme_minimal() +
  labs(title = "Cholesterol Levels by Age Group", x = "Cholesterol Level", y = "Frequency") +
  theme(strip.text = element_text(size = 10),  # Adjusts the size of the age group labels
        axis.text.x = element_text(angle = 45, hjust = 1))  # Rotates x-axis labels for readability

There seems to be some skewness in the data, so I will examine it further using a Q-Q plot.

ggplot(Cholesterol_AgeGroups, aes(sample = Total_Cholesterol)) +
  stat_qq() +
  stat_qq_line() +
  facet_wrap(~ Age_Group) +  # Creates separate QQ plots for each age group
  theme_minimal() +
  labs(title = "QQ Plots of Cholesterol Levels by Age Group", x = "Theoretical Quantiles", y = "Sample Quantiles") +
  theme(strip.text = element_text(size = 10),  # Adjusts the size of the age group labels
        axis.text.x = element_text(angle = 45, hjust = 1))  # Optional: Rotates x-axis labels for readability

Although the raw data appears to be reasonably close to normal, the Q-Q plot of the log-transformed data shows a much better fit, so I will proceed with the log-transformed data for a more accurate analysis

ggplot(Cholesterol_AgeGroups, aes(x = log(Total_Cholesterol))) +
  geom_histogram(binwidth = 0.1, color = "black", fill = "lightblue", alpha = 0.7) +
  facet_wrap(~ Age_Group, scales = "free_y") +  # Creates separate plots for each age group
  theme_minimal() +
  labs(title = "Log-transformed Cholesterol Levels by Age Group", x = "Cholesterol Level", y = "Frequency") +
  theme(strip.text = element_text(size = 10),  # Adjusts the size of the age group labels
        axis.text.x = element_text(angle = 45, hjust = 1))  # Rotates x-axis labels for readability

ggplot(Cholesterol_AgeGroups, aes(sample = log(Total_Cholesterol))) +
  stat_qq() +
  stat_qq_line() +
  facet_wrap(~ Age_Group) +  # Creates separate QQ plots for each age group
  theme_minimal() +
  labs(title = "QQ Plots of Cholesterol Levels by Age Group", x = "Theoretical Quantiles", y = "Sample Quantiles") +
  theme(strip.text = element_text(size = 10),  # Adjusts the size of the age group labels
        axis.text.x = element_text(angle = 45, hjust = 1))  # Optional: Rotates x-axis labels for readability

6.2 Main analysis

Hypotheses: Null hypothesis (H₀): There is no significant difference in the mean cholesterol levels across the different age groups.

Alternative hypothesis (H₁): At least one age group has a significantly different mean cholesterol level compared to the others.

Choice of Statistical Test:

I chose a one-way ANOVA (Analysis of Variance) to compare the cholesterol levels across multiple age groups. This test is appropriate because we are comparing the means of cholesterol levels across more than two groups (age groups), and the data appears to meet the assumptions of ANOVA. These assumptions include:

Independence: The cholesterol levels for each age group are independent of each other. Normality: The distribution of cholesterol levels within each age group is approximately normal. Homogeneity of variances: The variance in cholesterol levels is roughly equal across age groups.

Before conducting the ANOVA, I checked for outliers and ensured that the groups have approximately equal variances. If there were violations of normality or homogeneity, alternative tests or transformations would be considered.

# Step 3: ANOVA test for cholesterol levels across age groups
anova_result <- aov(log(Total_Cholesterol) ~ Age_Group, data = Cholesterol_AgeGroups)

# Step 4: Tukey HSD post-hoc analysis if ANOVA is significant
summary(anova_result)
               Df Sum Sq Mean Sq F value Pr(>F)    
Age_Group       7   79.8  11.395   266.7 <2e-16 ***
Residuals   10820  462.3   0.043                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

6.3 Conclusion

The results of the one-way ANOVA reveal a highly significant difference in cholesterol levels across the different age groups, with an F-value of 266.7 and a p-value of less than 2e-16. This allows us to reject the null hypothesis and conclude that there is a statistically significant difference in cholesterol levels between at least some of the age groups.

Reflection on Pre-existing Belief:

Before analyzing the data, there was an expectation that cholesterol levels would peak in middle age and decrease afterwards. The significant results of the ANOVA align with this expectation, indicating that age group plays a critical role in determining cholesterol levels.

6.4 Reasoning for Dashboard

KPIs:

Mean Cholesterol Level (mg/dL): Why this KPI?: The mean cholesterol level provides a clear, high-level summary of the overall cholesterol status of the population under study. By calculating the mean cholesterol level, you can get an idea of the average cholesterol level across all participants. Result (184.41 mg/dL): This value reflects the general cholesterol level within the study population. A mean value of 184.41 mg/dL indicates a moderate level of cholesterol, which can serve as a baseline for further analysis, such as comparing cholesterol levels across age groups or identifying trends over time. Participants with High Cholesterol (>240 mg/dL): Why this KPI?: Cholesterol levels above 240 mg/dL are considered high and are associated with increased cardiovascular risk. This KPI helps track how many individuals are at risk due to elevated cholesterol levels. Result (732 participants): This value shows the number of participants whose cholesterol levels are above 240 mg/dL, allowing us to gauge the prevalence of high cholesterol in the study population. It also helps in understanding the potential burden of health conditions related to high cholesterol. Visualizations: Violin Plot of All Ages to Compare Cholesterol Distribution:

Why this plot?: The violin plot is effective in showing the distribution of cholesterol levels across different age groups. It provides an overview of the spread, central tendency (median), and potential outliers of cholesterol levels for the entire dataset. Insights: The plot can visually highlight trends in cholesterol levels across different ages and detect patterns such as whether cholesterol levels tend to increase with age, or if there are any age groups with especially high or low values. Interactive Histogram for Each Age Group:

Why this plot?: The interactive histogram allows users to drill down into cholesterol distributions within specific age groups. By making it interactive, users can filter or zoom into particular age ranges for a more detailed analysis. Insights: This histogram helps to analyze the cholesterol levels in different age groups separately. It can show how the cholesterol distribution changes with age, identify outliers, and help track whether certain age groups have a higher concentration of participants with elevated cholesterol levels.

Overall Reasoning: By combining these KPIs and visualizations, the dashboard provides both high-level insights (mean cholesterol levels and prevalence of high cholesterol) and detailed, interactive exploration (age-specific distributions). This approach allows users to quickly grasp the general trends while also enabling deeper exploration of the data based on specific age groups.I decided to use the raw data for the dashboard to make the information more accessible and easier to interpret for a wider audience, while still considering the log-transformed data for more detailed statistical analysis in the blog.

This will support further analysis, such as understanding how age influences cholesterol levels and identifying key risk groups for targeted interventions or health advice.

7 Smoking and Hypertension

7.1 The Question

I want to study the relationship between smoking status and hypertension status in a population. Specifically, I aim to understand if there is an association between smoking and the prevalence of hypertension.

Research Question: Is there a significant association between smoking status and hypertension status in the study population?

Pre-existing Belief: Based on existing research and common medical knowledge, I hypothesize that smokers are more likely to have hypertension compared to non-smokers. This belief stems from the understanding that smoking is a known risk factor for the development of various cardiovascular conditions, including hypertension. Therefore, I expect to find a higher percentage of smokers among the participants with hypertension.

7.2 Data Descrption

The following is a description of the variables we use for the analysis.

# Create the codebook table for Smoking and Hypertension
data_description_smoking_hypertension <- data.frame(
  Variable_Name = c("Smoking_Status", "Hypertension_Status"),
  Description = c("Smoking status of the participant (Smoker/Non-Smoker)",
                  "Hypertension status of the participant (Yes/No)"),
  Type = c("Categorical", "Categorical"),
  Original_Variable_Name = c("SMQ020", "BPQ020")
)

# View the data description table
data_description_smoking_hypertension
        Variable_Name                                           Description
1      Smoking_Status Smoking status of the participant (Smoker/Non-Smoker)
2 Hypertension_Status       Hypertension status of the participant (Yes/No)
         Type Original_Variable_Name
1 Categorical                 SMQ020
2 Categorical                 BPQ020

The following are summaries of the data.

# Numeric summaries for Smoking and Hypertension
summary_stats_smoking_hypertension <- Smoking_Hypertension %>%
  group_by(Smoking_Status, Hypertension_Status) %>%
  summarise(
    Count = n(),
    Percentage = n() / nrow(Smoking_Hypertension) * 100,
    .groups = "drop"  # This removes the grouping after summarizing
  )

# View the summary statistics for Smoking and Hypertension
summary_stats_smoking_hypertension
# A tibble: 4 × 4
  Smoking_Status Hypertension_Status Count Percentage
  <fct>          <fct>               <int>      <dbl>
1 Smoker         Yes                  1717       17.7
2 Smoker         No                   2165       22.4
3 Non-Smoker     Yes                  1860       19.2
4 Non-Smoker     No                   3934       40.7
write.csv(Smoking_Hypertension, "smoking_hypertension_analysis_data.csv", row.names = FALSE)

The following are the plots to examine the distribution of the data.

# Bar plot for Smoking_Status
ggplot(Smoking_Hypertension, aes(x = Smoking_Status)) + 
  geom_bar() + 
  ggtitle("Smoking Status Distribution")

# Bar plot for Hypertension_Status
ggplot(Smoking_Hypertension, aes(x = Hypertension_Status)) + 
  geom_bar() + 
  ggtitle("Hypertension Status Distribution")

7.3 Main Analysis

In this analysis, we aim to investigate the relationship between smoking status and hypertension (high blood pressure) using data on participants’ smoking habits and their hypertension status. Our goal is to determine whether there is a statistically significant association between smoking and the likelihood of having hypertension. To do this, we will create a contingency table, perform a Chi-squared test, and calculate several metrics that describe the strength and direction of the association: Relative Risk (RR), Odds Ratio (OR), and Risk Difference (RD).

Step 1: Create the Contingency Table The first step in this analysis is to create a contingency table that summarizes the counts of individuals categorized by their smoking status and their hypertension status. This table will have two key categorical variables:

The contingency table helps us organize the data in a way that allows us to compare the frequencies of different combinations of these two variables. Please note that Yes/NO refer to hypertensive-status.

# Step 5: Create the contingency table for smoking status vs hypertension status
contingency_table <- Smoking_Hypertension %>%
  count(Smoking_Status, Hypertension_Status) %>%
  pivot_wider(names_from = Hypertension_Status, values_from = n, values_fill = list(n = 0))

# Print the contingency table
contingency_table
# A tibble: 2 × 3
  Smoking_Status   Yes    No
  <fct>          <int> <int>
1 Smoker          1717  2165
2 Non-Smoker      1860  3934

Step 2: Chi-Squared Test Once we have the contingency table, we perform a Chi-squared test to assess if there is a significant association between smoking and hypertension. The Chi-squared test is a statistical method used to determine whether two categorical variables are independent or related.

chi_squared_test <- chisq.test(contingency_table[, -1])
chi_squared_test

    Pearson's Chi-squared test with Yates' continuity correction

data:  contingency_table[, -1]
X-squared = 146.2, df = 1, p-value < 2.2e-16

Here, we exclude the first column (Smoking_Status) from the contingency table, because the test requires only the counts of hypertension statuses (Yes and No) for each smoking category. T he results of the chi-square test show a significant association between the two variables, with a test statistic of X-squared = 146.2, degrees of freedom (df) = 1, and a p-value less than 2.2e-16, indicating that the relationship is highly unlikely to be due to chance.

After performing the Chi-squared test, we will calculate three additional metrics: Relative Risk (RR), Odds Ratio (OR), and Risk Difference (RD). These metrics help us understand the strength and direction of the association between smoking and hypertension.

# Step 7: Calculate Relative Risk (RR), Odds Ratio (OR), and Risk Difference (RD)
# Calculate proportions for hypertensive vs non-hypertensive smokers and non-smokers
proportions <- contingency_table %>%
  mutate(
    total = Yes + No,
    smoker_yes = Yes / total,  # Proportion of hypertensive among smokers
    smoker_no = No / total    # Proportion of hypertensive among non-smokers
  )

# Calculate Relative Risk (RR)
relative_risk <- (contingency_table$Yes[1] / sum(contingency_table[1, c("Yes", "No")])) / 
                 (contingency_table$Yes[2] / sum(contingency_table[2, c("Yes", "No")]))

# Calculate Odds Ratio (OR)
odds_ratio <- (contingency_table$Yes[1] / contingency_table$No[1]) / 
              (contingency_table$Yes[2] / contingency_table$No[2])

# Calculate Risk Difference (RD)
risk_difference <- proportions$smoker_yes[1] - proportions$smoker_yes[2]

# Save the results in a data frame
results <- data.frame(
  Metric = c("Relative Risk (RR)", "Odds Ratio (OR)", "Risk Difference (RD)"),
  Value = c(round(relative_risk, 2), round(odds_ratio, 2), round(risk_difference, 2))
)

write.csv(results, "smoking_hypertension_analysis_results.csv", row.names = FALSE)

# Print the results
results
                Metric Value
1   Relative Risk (RR)  1.38
2      Odds Ratio (OR)  1.68
3 Risk Difference (RD)  0.12

7.4 Conclusion

Based on the analysis of the relationship between smoking status and hypertension, we can draw the following conclusions:

Research Question: The initial research question was to determine whether there is a significant association between smoking and the likelihood of having hypertension.

Chi-Squared Test Results: The Chi-squared test revealed a highly significant result with a p-value of less than 2.2e-16, which indicates that there is a very strong statistical association between smoking and hypertension. This suggests that smoking status and hypertension are not independent, and smoking is significantly associated with the likelihood of having hypertension.

Relative Risk (RR): The Relative Risk (RR) is 1.38, which indicates that smokers are 1.38 times more likely to develop hypertension compared to non-smokers. This suggests a moderate increased risk of hypertension among smokers.

Odds Ratio (OR): The Odds Ratio (OR) is 1.68, which means that the odds of hypertension in smokers are 1.68 times the odds of hypertension in non-smokers. This further reinforces the conclusion that smoking increases the odds of developing hypertension.

Risk Difference (RD): The Risk Difference (RD) is 0.12, meaning that 12% more smokers have hypertension compared to non-smokers. This gives us a clear sense of the absolute difference in hypertension prevalence between smokers and non-smokers.

Implications: The analysis strongly supports the hypothesis that smoking is associated with a higher risk of hypertension. These findings suggest that smoking may be an important modifiable risk factor for hypertension, which is a known contributor to various cardiovascular diseases. The moderate effect sizes (RR = 1.38, OR = 1.68) suggest that while smoking is a risk factor for hypertension, it is not the only factor, and other lifestyle or genetic factors may also contribute to the development of hypertension.

Logical Next Steps: A logical next step in this analysis could be to explore the relationship between smoking, hypertension, and other potential confounders such as age, gender, physical activity, or diet. A multivariate analysis could help control for these factors and better isolate the effect of smoking on hypertension risk. Additionally, investigating the interaction between smoking and other health conditions, like diabetes or obesity, could provide further insights into the complexity of this relationship.

Reflection on Pre-Existing Beliefs:

Before conducting this analysis, I expected that smoking would be linked to a higher risk of hypertension, as smoking is widely known to affect cardiovascular health. The results confirm this belief with strong statistical evidence. However, the analysis also highlights the importance of considering other potential contributing factors and suggests that the risk associated with smoking, while significant, is part of a broader health context that requires further exploration.

7.5 Reasoning for Dashboard

The goal of the dashboard is to visually present the key findings of the analysis regarding the association between smoking and hypertension, using key performance indicators (KPIs) and intuitive visualizations to support the data interpretation.

KPIs: Odds Ratio (OR) = 1.68: This indicates that smokers have 1.68 times higher odds of developing hypertension compared to non-smokers. This metric is central to understanding the strength of the association between smoking and hypertension.

Risk Difference (RD) = 12%: This indicates that 12% more smokers have hypertension than non-smokers. The RD is a straightforward measure of the absolute difference in the prevalence of hypertension between the two groups.

Visualizations:

Stacked Bar Plot for Smoking Status and Hypertension:

Purpose: To show the distribution of smokers and non-smokers in terms of their hypertension status. The plot will display the proportion of smokers and non-smokers who have hypertension (Yes) and who do not (No). X-axis: Smoking status (Smoker vs. Non-Smoker) Y-axis: Proportion of people in each group Stacked Bars: Each bar will be divided into segments for hypertensive and non-hypertensive individuals, allowing for a clear comparison of the two smoking groups.

Pie Chart of Hypertensive and Non-Hypertensive Participants:

Purpose: To provide a high-level overview of the proportion of participants with and without hypertension in the entire sample. Slices: The chart will display the proportion of participants who have hypertension versus those who do not, making it easy to see the overall distribution of hypertension in the dataset. Pie Chart of Smokers and Non-Smokers:

Purpose: To visualize the proportion of smokers versus non-smokers in the dataset. Slices: The chart will show the percentage of smokers and non-smokers, giving a sense of the prevalence of smoking in the population.

Justification for the Visualizations: Stacked Bar Plot: This is particularly useful for comparing multiple categorical variables, such as smoking status and hypertension status. The stacked bars will allow us to easily see the relationship between smoking and hypertension in one visualization. The visual emphasis on the proportion of hypertensive individuals within each smoking group makes it clear how smoking affects the likelihood of hypertension.

8 Physical Activity and Cholestroal

8.1 My Research Question

Research Question: How does physical activity (measured as the number of days of vigorous recreational activities per week) predict cholesterol levels, adjusting for key factors such as BMI, age, smoking status, and hypertension?

Introduction and Background: Cholesterol levels, particularly low-density lipoprotein (LDL) and high-density lipoprotein (HDL) cholesterol, are significant markers of cardiovascular risk. Elevated cholesterol levels, especially LDL, are associated with increased risk of heart disease, stroke, and other cardiovascular issues. Regular physical activity is widely recommended as part of lifestyle modifications to reduce cholesterol levels and improve heart health.

Despite existing evidence supporting physical activity’s impact on cholesterol levels, there is variability in individual responses, which may depend on several other factors such as body mass index (BMI), age, smoking status, and hypertension. Understanding the interaction between physical activity and these factors is crucial for creating tailored health interventions.

Pre-analytic Hypothesis: I hypothesize that physical activity, particularly vigorous recreational activity (PAQ655), will have a negative association with cholesterol levels (lower cholesterol). I also anticipate that this relationship will be modified by other factors such as BMI, age, smoking, and hypertension. In particular, individuals with higher BMI may have less favorable cholesterol profiles despite higher levels of physical activity.

Partitioning the Data:

I will partition the data into two samples:

Training Sample: 70% of the data Test Sample: 30% of the data

The training sample will be used to fit the model, and the test sample will be used for model validation.

The following code creates a tibble that includes all the variables.

original_data <- HDL_vs_LDL %>%
  inner_join(BMI_and_Gender, by = "SEQN") %>%
  inner_join(Cholesterol_AgeGroups, by = "SEQN") %>%
  inner_join(exercise_data_clean, by = "SEQN") %>%
  inner_join(Smoking_Hypertension, by = "SEQN")
original_data
# A tibble: 1,000 × 27
     SEQN   HDL   LDL   BMI Gender BMI_log Total_Cholesterol   Age Age_Group
    <dbl> <dbl> <dbl> <dbl> <fct>    <dbl>             <dbl> <dbl> <fct>    
 1 109313    54    81  25.2 Male      3.27               224    63 60-69    
 2 109326    85    70  21.6 Female    3.12               176    44 40-49    
 3 109327    58   160  23.7 Female    3.21               203    58 50-59    
 4 109371    60   170  29.3 Male      3.41               227    62 60-69    
 5 109403    35   120  27.1 Male      3.34               200    30 30-39    
 6 109426    52   103  27.9 Female    3.36               166    61 60-69    
 7 109441    63    44  18   Female    2.94               141    20 20-29    
 8 109444    64    36  21.7 Male      3.12               143    21 20-29    
 9 109454    32    45  50.9 Female    3.95               126    25 20-29    
10 109478    74    65  26.1 Female    3.30               192    27 20-29    
# ℹ 990 more rows
# ℹ 18 more variables: PAQ605 <dbl>, PAQ610 <dbl>, PAD615 <dbl>, PAQ620 <dbl>,
#   PAQ625 <dbl>, PAD630 <dbl>, PAQ635 <dbl>, PAQ640 <dbl>, PAD645 <dbl>,
#   PAQ650 <dbl>, vigorous_activity_days <int>, PAD660 <dbl>, PAQ665 <dbl>,
#   PAQ670 <dbl>, PAD675 <dbl>, PAD680 <dbl>, Smoking_Status <fct>,
#   Hypertension_Status <fct>

8.2 Data Description

This code filters the data to keep only rows with complete data on key variables (cholesterol, vigorous_activity_days, BMI, age, smoking, hypertension).

# Rename columns for clarity (assuming you want to use more intuitive names)
set.seed(123) # You can pick a different seed

original_data <- original_data %>%
  filter(!is.na(Total_Cholesterol) & !is.na(vigorous_activity_days) & !is.na(BMI) & !is.na(Age_Group) & !is.na(Smoking_Status))

original_data_cleaned <- original_data %>%
  select(
    SEQN,                            # Subject ID (always necessary)
    Total_Cholesterol,               # Cholesterol levels
    vigorous_activity_days,           # Vigorous activity days
    BMI,                              # BMI
    Age_Group,                        # Age group
    Smoking_Status                    # Smoking status
  )

original_data_cleaned
# A tibble: 1,000 × 6
     SEQN Total_Cholesterol vigorous_activity_days   BMI Age_Group
    <dbl>             <dbl>                  <int> <dbl> <fct>    
 1 109313               224                      1  25.2 60-69    
 2 109326               176                      5  21.6 40-49    
 3 109327               203                      5  23.7 50-59    
 4 109371               227                      3  29.3 60-69    
 5 109403               200                      2  27.1 30-39    
 6 109426               166                      4  27.9 60-69    
 7 109441               141                      5  18   20-29    
 8 109444               143                      2  21.7 20-29    
 9 109454               126                      7  50.9 20-29    
10 109478               192                      3  26.1 20-29    
# ℹ 990 more rows
# ℹ 1 more variable: Smoking_Status <fct>
# Create a description table using reframe
table_description <- original_data_cleaned %>%
  reframe(
    Variable = c("Subject ID", "Total Cholesterol", "Vigorous Activity Days", "BMI", "Age Group", "Smoking Status"),
    Description = c(
      "Unique identifier for each subject",
      "Total cholesterol levels (mg/dL), outcome variable",
      "Number of days per week with vigorous physical activity, key predictor",
      "Body Mass Index (kg/m²)",
      "Age group categories (e.g., 18-24, 25-34, etc.)",
      "Smoking status (e.g., smoker, non-smoker)"
    ),
    Highlight = c("", "Outcome", "Key Indicator", "", "", "")
  )

# Render table with highlights for Outcome and Key Indicator
datatable(table_description, 
          colnames = c("Variable", "Description", "Comments"),
          options = list(pageLength = 6)) %>%
  formatStyle(
    'Highlight',
    target = 'cell',
    backgroundColor = styleEqual(c('Outcome', 'Key Indicator'), c('yellow', 'lightblue'))
  )

The following code extracts relevant summary statistics to examine the data:

# Load necessary libraries


# Summary Statistics for Numeric Variables (BMI, Total Cholesterol, Vigorous Activity Days)
write.csv(original_data_cleaned,"study2.csv", row.names = FALSE)
total_participants <- n_distinct(original_data_cleaned$SEQN)

summary_stats <- original_data_cleaned %>%
  summarise(
    Mean_BMI = mean(BMI, na.rm = TRUE),
    SD_BMI = sd(BMI, na.rm = TRUE),
    Min_BMI = min(BMI, na.rm = TRUE),
    Max_BMI = max(BMI, na.rm = TRUE),
    
    Mean_Cholesterol = mean(Total_Cholesterol, na.rm = TRUE),
    SD_Cholesterol = sd(Total_Cholesterol, na.rm = TRUE),
    Min_Cholesterol = min(Total_Cholesterol, na.rm = TRUE),
    Max_Cholesterol = max(Total_Cholesterol, na.rm = TRUE),
    
    Mean_Vigorous_Activity_Days = mean(vigorous_activity_days, na.rm = TRUE),
    SD_Vigorous_Activity_Days = sd(vigorous_activity_days, na.rm = TRUE),
    Min_Vigorous_Activity_Days = min(vigorous_activity_days, na.rm = TRUE),
    Max_Vigorous_Activity_Days = max(vigorous_activity_days, na.rm = TRUE)
  )
summary_stats_long <- summary_stats %>%
  pivot_longer(
    everything(),
    names_to = "Statistic",
    values_to = "Value"
  )
# Summary of Categorical Variables (Age Group, Smoking Status)
categorical_summary <- original_data_cleaned %>%
  summarise(
    Age_Group_Counts = list(table(Age_Group)),
    Smoking_Status_Counts = list(table(Smoking_Status))
  )
# Create summary for categorical variables and transform them into data frames
age_group_counts <- as.data.frame(table(original_data_cleaned$Age_Group))
names(age_group_counts) <- c("Age_Group", "Count")

smoking_status_counts <- as.data.frame(table(original_data_cleaned$Smoking_Status))
names(smoking_status_counts) <- c("Smoking_Status", "Count")
original_data_filtered <- original_data_cleaned
# Now you can pass these data frames to plotly

Total number of Participants:

total_participants
[1] 1000

Summary Stats:

summary_stats_long
# A tibble: 12 × 2
   Statistic                    Value
   <chr>                        <dbl>
 1 Mean_BMI                     28.3 
 2 SD_BMI                        6.74
 3 Min_BMI                      15.5 
 4 Max_BMI                      82   
 5 Mean_Cholesterol            180.  
 6 SD_Cholesterol               39.9 
 7 Min_Cholesterol              76   
 8 Max_Cholesterol             370   
 9 Mean_Vigorous_Activity_Days   3.39
10 SD_Vigorous_Activity_Days     1.53
11 Min_Vigorous_Activity_Days    1   
12 Max_Vigorous_Activity_Days    7   
age_group_counts
  Age_Group Count
1       <20    97
2     20-29   259
3     30-39   188
4     40-49   175
5     50-59   136
6     60-69    92
7     70-79    39
8       80+    14

Note: Please note that all the data has been cleaned in an earlier section of this blog post, so I will proceed with the rest of the analysis.

# Descriptive statistics for continuous variables
summary_stats_continuous <- original_data_filtered %>%
  select(Total_Cholesterol, vigorous_activity_days, BMI) %>%
  summary()

# Descriptive statistics for categorical variables (Age_Group, smoking_status)
summary_stats_categorical <- original_data_filtered %>%
  select(Age_Group, Smoking_Status) %>%
  summarise(
    Age_Group_counts = list(table(Age_Group)),
    Smoking_status_counts = list(table(Smoking_Status))
  )

8.3 Partitioning the Data

The following code splits the data to a training and a test set.

# Set a random seed to ensure reproducibility of the split
set.seed(123) 

training_sample <- original_data_filtered %>%
  slice_sample(prop = 0.70)

test_sample <- anti_join(original_data_filtered, training_sample, by = "SEQN")

# Verify that the split is correct
cat("Training Sample Size:", nrow(training_sample), "\n")
Training Sample Size: 700 
cat("Test Sample Size:", nrow(test_sample), "\n")
Test Sample Size: 300 
all_subjects <- union(training_sample$SEQN, test_sample$SEQN)

# Ensure that the number of unique subjects matches the original dataset
if (length(all_subjects) == nrow(original_data_filtered)) {
  cat("All subjects are accounted for in the training and test samples.\n")
} else {
  cat("Some subjects are missing or duplicated in the training and test samples.\n")
}
All subjects are accounted for in the training and test samples.

The following code plots the outcome variables to examine the data and its normality.

# Visualize the distribution of cholesterol levels using a histogram
ggplot(training_sample, aes(x = Total_Cholesterol)) +
  geom_histogram(binwidth = 0.5, fill = "blue", color = "black", alpha = 0.7) +
  labs(title = "Distribution of Cholesterol Levels", x = "Cholesterol Levels", y = "Frequency") +
  theme_minimal()

ggplot(training_sample, aes(sample = (Total_Cholesterol))) +
  geom_qq() +
  geom_qq_line() +
  labs(title = "Q-Q Plot for Cholesterol Levels") +
  theme_minimal()

The distribution appears asymmetric and almost normal, so I believe a square root transformation would be a better approach to improve normality and stabilize variance.

# Visualize the distribution of cholesterol levels using a histogram
ggplot(training_sample, aes(x = sqrt(Total_Cholesterol))) +
  geom_histogram(binwidth = 0.8, fill = "blue", color = "black", alpha = 0.7) +
  labs(title = "Distribution of Cholesterol Levels", x = "Cholesterol Levels", y = "Frequency") +
  theme_minimal()

ggplot(training_sample, aes(sample = log(Total_Cholesterol))) +
  geom_qq() +
  geom_qq_line() +
  labs(title = "Q-Q Plot for Cholesterol Levels") +
  theme_minimal()

I will proceed with the square root transformed data for further analysis.

8.4 The Big Model

The following code builds the linear regression model.

# Fit the full model (including all predictors)
full_model <- lm(sqrt(Total_Cholesterol) ~ vigorous_activity_days + BMI + Age_Group + Smoking_Status, data = training_sample)

# Summarize the full model
summary(full_model)

Call:
lm(formula = sqrt(Total_Cholesterol) ~ vigorous_activity_days + 
    BMI + Age_Group + Smoking_Status, data = training_sample)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.0547 -0.8770 -0.0493  0.7757  4.2901 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)              12.635348   0.312309  40.458  < 2e-16 ***
vigorous_activity_days   -0.025869   0.033979  -0.761  0.44672    
BMI                      -0.012491   0.007825  -1.596  0.11088    
Age_Group20-29            0.889587   0.197795   4.498 8.07e-06 ***
Age_Group30-39            1.444715   0.207615   6.959 8.01e-12 ***
Age_Group40-49            1.666069   0.211028   7.895 1.15e-14 ***
Age_Group50-59            1.579022   0.224690   7.028 5.06e-12 ***
Age_Group60-69            1.552336   0.248903   6.237 7.79e-10 ***
Age_Group70-79            0.908129   0.324803   2.796  0.00532 ** 
Age_Group80+              0.812359   0.516529   1.573  0.11624    
Smoking_StatusNon-Smoker -0.031346   0.114507  -0.274  0.78436    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.374 on 689 degrees of freedom
Multiple R-squared:  0.1183,    Adjusted R-squared:  0.1055 
F-statistic: 9.248 on 10 and 689 DF,  p-value: 1.858e-14
# Tidy the model to extract coefficients and other statistics
tidy_full_model <- tidy(full_model)
tidy_full_model
# A tibble: 11 × 5
   term                     estimate std.error statistic   p.value
   <chr>                       <dbl>     <dbl>     <dbl>     <dbl>
 1 (Intercept)               12.6      0.312      40.5   3.46e-184
 2 vigorous_activity_days    -0.0259   0.0340     -0.761 4.47e-  1
 3 BMI                       -0.0125   0.00782    -1.60  1.11e-  1
 4 Age_Group20-29             0.890    0.198       4.50  8.07e-  6
 5 Age_Group30-39             1.44     0.208       6.96  8.01e- 12
 6 Age_Group40-49             1.67     0.211       7.90  1.15e- 14
 7 Age_Group50-59             1.58     0.225       7.03  5.06e- 12
 8 Age_Group60-69             1.55     0.249       6.24  7.79e- 10
 9 Age_Group70-79             0.908    0.325       2.80  5.32e-  3
10 Age_Group80+               0.812    0.517       1.57  1.16e-  1
11 Smoking_StatusNon-Smoker  -0.0313   0.115      -0.274 7.84e-  1

The linear regression model results suggest several important findings related to significant predictors and model fit.

Significant Predictors:

Age Group: Age appears to be a significant predictor of the outcome variable, with all age groups (20-29, 30-39, 40-49, 50-59, 60-69, and 70-79) showing statistically significant positive associations with the outcome (p-values < 0.05). Specifically, individuals in the age groups 20-29, 30-39, 40-49, 50-59, and 60-69 have significantly higher outcome values compared to the baseline group (presumably those under 20). The effect for the 70-79 age group is also significant, though with a smaller coefficient estimate. However, the 80+ age group was not significant (p = 0.11624), suggesting diminishing returns for older age groups. Vigorous Activity, BMI, and Smoking Status: Neither vigorous activity days, BMI, nor smoking status were found to be significant predictors of the outcome, with p-values of 0.447, 0.111, and 0.784, respectively. This suggests that these variables do not have a strong influence on the outcome after controlling for other factors. Model Fit:

Residuals: The residuals range from -5.05 to 4.29, with a median near zero (-0.0493), suggesting a reasonably symmetric distribution of errors. The interquartile range is 0.88, indicating moderate variability in the residuals. R-squared and Adjusted R-squared: The model has a multiple R-squared of 0.1183 and an adjusted R-squared of 0.1055, indicating that only about 11.8% of the variance in the outcome is explained by the predictors. While this is relatively low, the model still provides useful insight into the relationship between age and the outcome variable. The adjusted R-squared suggests that, after accounting for the number of predictors in the model, the fit remains modest. F-statistic and p-value: The F-statistic is 9.248, with a p-value of 1.858e-14, indicating that the model as a whole is statistically significant and provides a better fit than an intercept-only model.

8.5 The Smaller Model

The following code implements a subset model using the key predictor, and Age_group.

# Fit the subset model (including the key predictor and other important variables)
subset_model <- lm(sqrt(Total_Cholesterol) ~ vigorous_activity_days + Age_Group, data = training_sample)

# Summarize the subset model
summary(subset_model)

Call:
lm(formula = sqrt(Total_Cholesterol) ~ vigorous_activity_days + 
    Age_Group, data = training_sample)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.0436 -0.8984 -0.0769  0.8001  4.2519 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)            12.27689    0.20725  59.237  < 2e-16 ***
vigorous_activity_days -0.02450    0.03389  -0.723  0.47001    
Age_Group20-29          0.87386    0.19563   4.467 9.26e-06 ***
Age_Group30-39          1.42659    0.20576   6.933 9.46e-12 ***
Age_Group40-49          1.63137    0.20840   7.828 1.86e-14 ***
Age_Group50-59          1.54645    0.22204   6.965 7.67e-12 ***
Age_Group60-69          1.53349    0.24347   6.299 5.35e-10 ***
Age_Group70-79          0.90726    0.32254   2.813  0.00505 ** 
Age_Group80+            0.80880    0.51468   1.571  0.11653    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.375 on 691 degrees of freedom
Multiple R-squared:  0.115, Adjusted R-squared:  0.1048 
F-statistic: 11.23 on 8 and 691 DF,  p-value: 5.255e-15
# Tidy the model to extract coefficients and other statistics
tidy_subset_model <- tidy(subset_model)
tidy_subset_model
# A tibble: 9 × 5
  term                   estimate std.error statistic   p.value
  <chr>                     <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)             12.3       0.207     59.2   5.36e-273
2 vigorous_activity_days  -0.0245    0.0339    -0.723 4.70e-  1
3 Age_Group20-29           0.874     0.196      4.47  9.26e-  6
4 Age_Group30-39           1.43      0.206      6.93  9.46e- 12
5 Age_Group40-49           1.63      0.208      7.83  1.86e- 14
6 Age_Group50-59           1.55      0.222      6.96  7.67e- 12
7 Age_Group60-69           1.53      0.243      6.30  5.35e- 10
8 Age_Group70-79           0.907     0.323      2.81  5.05e-  3
9 Age_Group80+             0.809     0.515      1.57  1.17e-  1

The results of the subset linear regression model, where the outcome variable is the square root of total cholesterol, reveal several key findings about the significant predictors and overall model fit.

Significant Predictors:

Age Group: Similar to the previous model, age is a significant predictor of the outcome variable. All age groups, except for the 80+ group, show significant positive associations with total cholesterol (p-values < 0.05). The 20-29, 30-39, 40-49, 50-59, 60-69, and 70-79 age groups all have significantly higher cholesterol levels compared to the baseline group (presumably under 20). The 80+ group was not significant (p = 0.11653), suggesting that the effect of age plateaus at older ages. Vigorous Activity: Vigorous activity days was not a significant predictor in this model (p = 0.47001), indicating that the amount of vigorous activity does not have a meaningful relationship with cholesterol levels after controlling for age group. Model Fit:

Residuals: The residuals range from -5.04 to 4.25, with a median of -0.0769, indicating a relatively symmetric distribution of errors. The interquartile range is 0.90, suggesting moderate variability in the residuals. R-squared and Adjusted R-squared: The multiple R-squared for the model is 0.115, and the adjusted R-squared is 0.1048, meaning the model explains approximately 11.5% of the variance in cholesterol levels. This is a modest fit, similar to the full model, suggesting that other unexamined factors could play a role in explaining cholesterol variation. The adjusted R-squared indicates that, after considering the number of predictors, the fit is still modest. F-statistic and p-value: The F-statistic is 11.23 with a p-value of 5.255e-15, which indicates that the model is statistically significant and provides a better fit than an intercept-only model.

8.6 In-Sample Comparison

8.6.1 Quality of Fit

The following is comparison of key information about both models.

# Evaluate the full model
full_model_r2 <- summary(full_model)$r.squared
full_model_adj_r2 <- summary(full_model)$adj.r.squared
full_model_rmse <- sqrt(mean(residuals(full_model)^2))

# Evaluate the subset model
subset_model_r2 <- summary(subset_model)$r.squared
subset_model_adj_r2 <- summary(subset_model)$adj.r.squared
subset_model_rmse <- sqrt(mean(residuals(subset_model)^2))
# Calculate AIC and BIC for both models
full_model_aic <- AIC(full_model)
subset_model_aic <- AIC(subset_model)

full_model_bic <- BIC(full_model)
subset_model_bic <- BIC(subset_model)
# Print comparison
comparison <- tibble(
  Model = c("Full Model", "Subset Model"),
  R_squared = c(full_model_r2, subset_model_r2),
  Adjusted_R_squared = c(full_model_adj_r2, subset_model_adj_r2),
  RMSE = c(full_model_rmse, subset_model_rmse),
  AIC = c(full_model_aic,subset_model_aic),
  BIC = c(full_model_bic,subset_model_bic)
)

comparison
# A tibble: 2 × 6
  Model        R_squared Adjusted_R_squared  RMSE   AIC   BIC
  <chr>            <dbl>              <dbl> <dbl> <dbl> <dbl>
1 Full Model       0.118              0.106  1.36 2445. 2499.
2 Subset Model     0.115              0.105  1.37 2443. 2489.
write_csv(comparison, "model_comparison.csv")

8.6.2 Posterior Predictive Checks

For each model, we will generate predictions using the model’s coefficients and residuals. These predictions can then be compared with the observed data.

simulated_full_model <- simulate(full_model, nsim = 1000)  # Simulate 1000 datasets

# Simulate from the subset model
simulated_subset_model <- simulate(subset_model, nsim = 1000)  # Simulate 1000 datasets

# Extract observed data (original values)
observed_data <- sqrt(training_sample$Total_Cholesterol)  # Replace with your actual response variable
sim_full_data <- simulated_full_model[[1]]  # First simulated dataset
sim_subset_data <- simulated_subset_model[[1]]  # First simulated dataset

# Create a data frame for ggplot
observed_df <- data.frame(Value = observed_data, Type = "Observed")
sim_full_df <- data.frame(Value = sim_full_data, Type = "Simulated Full Model")
sim_subset_df <- data.frame(Value = sim_subset_data, Type = "Simulated Subset Model")

# Combine the data frames
combined_df <- rbind(observed_df, sim_full_df, sim_subset_df)

# Plot the histograms using ggplot2
library(ggplot2)

# Histogram for Full Model
ggplot(combined_df, aes(x = Value, fill = Type)) +
  geom_histogram(alpha = 0.5, position = "identity", bins = 30, color = "black") +
  scale_fill_manual(values = c("lightblue", "red", "green")) +
  labs(title = "Full Model: Observed vs Simulated", x = "Total Cholesterol", y = "Frequency") +
  theme_minimal()

# Q-Q plot for Full Model
ggplot() +
  stat_qq(data = data.frame(Value = observed_data), aes(sample = Value), color = "blue") +
  stat_qq_line(data = data.frame(Value = observed_data), aes(sample = Value), color = "blue") +
  stat_qq(data = data.frame(Value = sim_full_data), aes(sample = Value), color = "red") +
  stat_qq_line(data = data.frame(Value = sim_full_data), aes(sample = Value), color = "red") +
  labs(title = "Q-Q Plot: Full Model", x = "Theoretical Quantiles", y = "Sample Quantiles") +
  theme_minimal()

# Q-Q plot for Subset Model
ggplot() +
  stat_qq(data = data.frame(Value = observed_data), aes(sample = Value), color = "blue") +
  stat_qq_line(data = data.frame(Value = observed_data), aes(sample = Value), color = "blue") +
  stat_qq(data = data.frame(Value = sim_subset_data), aes(sample = Value), color = "red") +
  stat_qq_line(data = data.frame(Value = sim_subset_data), aes(sample = Value), color = "red") +
  labs(title = "Q-Q Plot: Subset Model", x = "Theoretical Quantiles", y = "Sample Quantiles") +
  theme_minimal()

When examining the posterior predictive checks for both the full and subset models, there appear to be systematic discrepancies between the observed and simulated data, particularly in the middle of the distribution. In both models, the simulated data does not align closely with the observed data, showing noticeable gaps in the middle range of cholesterol values. This suggests that the models might not fully capture the central tendencies of the observed data. Specifically, the simulated values tend to deviate from the observed values around the median, indicating that both models may be overestimating or underestimating the cholesterol levels in this region. This discrepancy points to potential limitations in the model structure, such as the choice of predictors or the functional form of the model. Further refinement or alternative modeling approaches may be necessary to better fit the observed data, particularly in the central range of the outcome variable.

8.6.3 Assessing Assumptions,

# 1. Residual vs. Fitted Plot
par(mfrow = c(1, 2))  # Set up a 1x2 grid for the plots
# Residual vs Fitted for subset_model
plot(subset_model$fitted.values, residuals(subset_model), 
     xlab = "Fitted Values", ylab = "Residuals", 
     main = "Residual vs Fitted (Subset Model)", pch = 16)
abline(h = 0, col = "red")
# Residual vs Fitted for full_model
plot(full_model$fitted.values, residuals(full_model), 
     xlab = "Fitted Values", ylab = "Residuals", 
     main = "Residual vs Fitted (Full Model)", pch = 16)
abline(h = 0, col = "red")

# 2. Q-Q Plot to check Normality of Residuals
par(mfrow = c(1, 2))  # Set up a 1x2 grid for the plots
# Q-Q Plot for subset_model
qqnorm(residuals(subset_model), main = "Q-Q Plot (Subset Model)")
qqline(residuals(subset_model), col = "red")
# Q-Q Plot for full_model
qqnorm(residuals(full_model), main = "Q-Q Plot (Full Model)")
qqline(residuals(full_model), col = "red")

# 3. Histogram of Residuals
par(mfrow = c(1, 2))  # Set up a 1x2 grid for the plots
# Histogram for subset_model
hist(residuals(subset_model), main = "Histogram of Residuals (Subset Model)", 
     xlab = "Residuals", col = "lightblue", breaks = 20)
# Histogram for full_model
hist(residuals(full_model), main = "Histogram of Residuals (Full Model)", 
     xlab = "Residuals", col = "lightgreen", breaks = 20)

# 4. Cook's Distance Plot to check for Influential Points
par(mfrow = c(1, 2))  # Set up a 1x2 grid for the plots
# Cook's Distance for subset_model
plot(cooks.distance(subset_model), type = "h", main = "Cook's Distance (Subset Model)", 
     ylab = "Cook's Distance", xlab = "Index")
abline(h = 1, col = "red")
# Cook's Distance for full_model
plot(cooks.distance(full_model), type = "h", main = "Cook's Distance (Full Model)", 
     ylab = "Cook's Distance", xlab = "Index")
abline(h = 1, col = "red")# 4. Cook's Distance Plot to check for Influential Points

par(mfrow = c(1, 2))  # Set up a 1x2 grid for the plots

The residual vs. fitted plots for both models indicate that the assumption of linearity is reasonably met, as there is no clear pattern in the residuals, though some clustering around certain fitted values suggests potential structure in the data. However, the assumption of constant variance (homoscedasticity) appears slightly violated, with some residuals showing greater spread at specific ranges of fitted values. The Q-Q plots reveal that the residuals mostly follow the expected normal distribution, adhering well to the diagonal line for most quantiles. Nonetheless, deviations at the tails, particularly in the full model, suggest potential issues with outliers or non-normality in the extremes. The Cook’s Distance plots for both the Full Model and Subset Model reveal that most data points have minimal influence, as indicated by low Cook’s Distance values, while a few points stand out as potentially influential. The Full Model, which includes predictors such as the number of vigorous activity days, the square root of BMI, age group, and smoking status, shows slightly higher Cook’s Distance values for some observations compared to the Subset Model, which excludes smoking status. This suggests that including smoking status might amplify the influence of certain data points, possibly due to interaction effects or multicollinearity with other predictors. Both models identify similar influential observations, indicating consistency in the data’s behavior. While these minor violations might not significantly impact the model’s overall validity, further investigation of the outliers and potential transformations could improve the model’s assumptions and robustness.

8.6.4 Comparing the Models

Strengths and Weaknesses Goodness of Fit and Predictive Performance:

The full model has a slightly higher R² (0.118 vs. 0.115) and adjusted R² (0.106 vs. 0.105), as well as a marginally lower RMSE (1.363 vs. 1.366). However, these differences are minimal, suggesting little improvement from including additional predictors. Both models exhibit systematic discrepancies in posterior predictive checks, particularly around the median cholesterol values, indicating that neither fully captures central tendencies. Model Complexity and Penalization:

The subset model has lower AIC (2443.17 vs. 2444.55) and BIC (2488.69 vs. 2499.17), suggesting better parsimony and generalization. Assumptions:

Both models meet linearity assumptions reasonably well but show slight heteroscedasticity in residuals. Q-Q plots reveal alignment with normality for most quantiles, but deviations at the tails, more pronounced in the full model, suggest issues with outliers or extreme values. Preferred Model The subset model is preferable due to its lower AIC, BIC, and comparable performance to the full model, making it a simpler and more generalizable choice. While both models share limitations in posterior predictive checks, the subset model strikes a better balance between simplicity and predictive accuracy.

The following plot summerizes key metrics.

# Create a data frame for the model performance metrics
model_metrics <- data.frame(
  Metric = c("R²", "Adjusted R²", "RMSE", "AIC", "BIC"),
  Full_Model = c(0.1183343, 0.1055380, 1.363464, 2444.554, 2499.167),
  Subset_Model = c(0.1150278, 0.1047821, 1.366019, 2443.174, 2488.685)
)

# Reshape the data for ggplot (long format)
model_metrics_long <- model_metrics %>%
  gather(key = "Model", value = "Value", -Metric)

# Log transform the metrics with large scale (AIC, BIC)
model_metrics_long$Log_Value <- ifelse(model_metrics_long$Metric %in% c("AIC", "BIC"), 
                                       log(model_metrics_long$Value), 
                                       model_metrics_long$Value)

# Create the plot
ggplot(model_metrics_long, aes(x = Metric, y = Log_Value, fill = Model)) +
  geom_bar(stat = "identity", position = "dodge") +
  scale_y_continuous(labels = scales::comma_format(), 
                     sec.axis = sec_axis(~ ., breaks = log(c(1000, 10000, 100000)), labels = c(1000, 10000, 100000))) +
  labs(title = "Comparison of Model Performance Metrics", 
       x = "Metric", 
       y = "Log-Transformed Value") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_fill_manual(values = c("Full_Model" = "steelblue", "Subset_Model" = "lightgreen"))

8.7 Model Validation

8.7.1 Calculating Prediction Errors

This code applies both model on the test_sample and does the back-tranformation.

pred_full_model <- predict(full_model, newdata = test_sample)
pred_subset_model <- predict(subset_model, newdata = test_sample)

# Back-transform the predictions (since we applied log transformation, we need to exponentiate the predictions)
back_transformed_pred_full <- pred_full_model^2
back_transformed_pred_subset <- pred_subset_model^2

# Add the predictions (back-transformed) to the test_sample data frame
test_sample$Pred_Full_Model <- back_transformed_pred_full
test_sample$Pred_Subset_Model <- back_transformed_pred_subset

# View the test sample with predictions
head(test_sample)
# A tibble: 6 × 8
    SEQN Total_Cholesterol vigorous_activity_days   BMI Age_Group Smoking_Status
   <dbl>             <dbl>                  <int> <dbl> <fct>     <fct>         
1 109313               224                      1  25.2 60-69     Non-Smoker    
2 109327               203                      5  23.7 50-59     Non-Smoker    
3 109371               227                      3  29.3 60-69     Non-Smoker    
4 109441               141                      5  18   20-29     Non-Smoker    
5 109454               126                      7  50.9 20-29     Non-Smoker    
6 109503               152                      3  27.4 70-79     Smoker        
# ℹ 2 more variables: Pred_Full_Model <dbl>, Pred_Subset_Model <dbl>

8.7.2 Visualizing the Predictions

The following plot visualizes the prediction and observed data.

# Create a combined data frame for the observed vs predicted values for both models
visualization_data <- test_sample %>%
  select(Total_Cholesterol, Pred_Full_Model, Pred_Subset_Model) %>%
  gather(key = "Model", value = "Predicted", -Total_Cholesterol)

# Create the plot
ggplot(visualization_data, aes(x = Predicted, y = Total_Cholesterol, color = Model)) +
  geom_point(alpha = 0.6) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "black") + # Line for observed = predicted
  theme_minimal() +
  labs(
    title = "Predicted vs Observed Cholesterol Levels",
    x = "Predicted Cholesterol Level",
    y = "Observed Cholesterol Level",
    color = "Model"
  ) +
  theme(legend.position = "top") +
  scale_color_manual(values = c("blue", "red"))

8.7.3 Summarizing the Errors

The following code summarizes infromation about the models.

# Compute the errors for the full model
test_sample$Predicted_Full_Model <- predict(full_model, newdata = test_sample)
test_sample$Predicted_Subset_Model <- predict(subset_model, newdata = test_sample)

# Back-transformation (square root)
test_sample$Actual_Total_Cholesterol <- test_sample$Total_Cholesterol^2

# Full Model Errors
full_model_residuals <- test_sample$Actual_Total_Cholesterol - test_sample$Predicted_Full_Model^2
full_model_rmspe <- sqrt(mean(full_model_residuals^2))
full_model_mape <- mean(abs(full_model_residuals / test_sample$Actual_Total_Cholesterol))
full_model_mae <- max(abs(full_model_residuals))
full_model_r_squared <- cor(test_sample$Actual_Total_Cholesterol, test_sample$Predicted_Full_Model)^2

# Subset Model Errors
subset_model_residuals <- test_sample$Actual_Total_Cholesterol - test_sample$Predicted_Subset_Model^2
subset_model_rmspe <- sqrt(mean(subset_model_residuals^2))
subset_model_mape <- mean(abs(subset_model_residuals / test_sample$Actual_Total_Cholesterol))
subset_model_mae <- max(abs(subset_model_residuals))
subset_model_r_squared <- cor(test_sample$Actual_Total_Cholesterol, test_sample$Predicted_Subset_Model)^2

# Combine the results into a table
error_summary <- data.frame(
  Metric = c("RMSPE", "MAPE", "MAE", "Squared Correlation"),
  Full_Model = c(full_model_rmspe, full_model_mape, full_model_mae, full_model_r_squared),
  Subset_Model = c(subset_model_rmspe, subset_model_mape, subset_model_mae, subset_model_r_squared)
)

# Print the error summary table
kable(error_summary, caption = "Model Performance Error Metrics")
Model Performance Error Metrics
Metric Full_Model Subset_Model
RMSPE 3.677902e+04 3.677900e+04
MAPE 9.935081e-01 9.935029e-01
MAE 1.367273e+05 1.367281e+05
Squared Correlation 8.284200e-02 8.038260e-02

8.7.4 Comparing the Models

Observations from the Metrics: RMSPE and MAE: Both models have almost identical RMSPE and MAE values, suggesting that the models have very similar prediction errors, regardless of which model is used. This indicates that the overall magnitude of prediction errors for both models is comparable.

MAPE: The MAPE values for both models are extremely close (0.9935), meaning that the percentage error between predicted and observed values is nearly identical for both models.

Squared Correlation (R²): Both models exhibit low squared correlation, with the full model performing slightly better (0.0828 vs. 0.0804). However, neither model has a high R², indicating that the models do not explain much of the variance in the observed cholesterol levels.

Visual Comparison: The scatter plot you uploaded illustrates the relationship between the observed and predicted cholesterol levels for both models. Here are some insights from the plot:

General Trend: Both models seem to predict similar ranges of cholesterol values, but neither model perfectly matches the observed values, especially for the higher cholesterol levels. The line showing “observed = predicted” (the dashed line) is not closely matched by the points, indicating a relatively weak model fit.

Clustering and Spread: Both the full model (blue) and the subset model (red) show some clustering of points near the lower predicted values (around 150-160), with a wider spread as the predicted values increase. This suggests that both models are somewhat conservative in predicting higher cholesterol levels.

Conclusion: While both models have nearly identical performance metrics and display similar trends in the scatter plot, the full model may have a slight edge due to its marginally better squared correlation. However, the low R² for both models suggests that there may be a need for further refinement or alternative modeling techniques. Based on these results, I would prefer the full model, but with the understanding that neither model is particularly strong in explaining the variance in the outcome. Both models could benefit from further exploration or adjustments.

8.8 Discussion

8.8.1 Chosen Model

Based on the results, I would choose the Full Model for further analysis, and here’s why:

Slightly Better Performance Metrics: The Full Model has a marginally better squared correlation (0.0828 vs. 0.0804) compared to the Subset Model. While the difference is small, every bit of improvement in model fit could lead to better predictive performance, especially when dealing with more complex datasets.

More Predictive Power: Although both models have similar RMSPE, MAPE, and MAE values, the Full Model slightly outperforms the Subset Model in terms of fitting the observed data. Even though the difference isn’t huge, the small improvement could still make a meaningful difference in practice.

Model Complexity: The Full Model likely incorporates more features, which, while introducing more complexity, also potentially captures more nuances in the data. If the Full Model includes key predictors that the Subset Model lacks, this might help explain certain aspects of the outcome better than the simpler model.

8.8.2 Answering My Question

The research question asks: How does physical activity (measured as the number of days of vigorous recreational activities per week) predict cholesterol levels, adjusting for key factors such as BMI, age, smoking status, and hypertension?

Based on the results from the full model, which included all the key predictors (vigorous activity, BMI, age group, smoking status), we can make the following conclusions:

The model explains approximately 11.36% of the variance in cholesterol levels in the training sample. This means that, while the full model offers some predictive power, a substantial portion of cholesterol levels’ variability remains unexplained by these factors alone.

Vigorous activity (number of days per week) did not appear to be a strong predictor of cholesterol levels in the full model, as evidenced by the relatively low R-squared value. The model also showed modest performance when validated on the test sample (R-squared = 8.3%), suggesting that it may not generalize well.

Key predictors such as BMI, age, smoking status, and hypertension did not provide much additional explanatory power to cholesterol levels in the context of this model.

This suggests that physical activity, as captured by the number of days of vigorous recreational activities, may not be the primary driver of cholesterol levels when adjusted for the other variables in this dataset. Other factors, possibly unmeasured or more complex interactions, might be playing a significant role.

Limitations of the Study:

Model Fit and Predictive Power:

The low R-squared values (both in-sample and out-of-sample) indicate that the model does not explain a substantial portion of the variability in cholesterol levels. This suggests that the model’s predictive power is weak. Data Quality and Missing Information:

Missing data could have impacted model performance, especially for variables like smoking status, hypertension, or vigorous activity. Though imputation was done, it is possible that imputed data may not fully capture the true relationships in the data. Potential Confounding Factors:

Other important confounders or mediators (such as diet, genetics, medication, or socioeconomic status) may not have been included in the model, leading to residual confounding. This could limit the interpretation of the findings. Measurement of Physical Activity:

The measure of vigorous activity used here (self-reported days of activity per week) may not fully capture the intensity, duration, or nature of the physical activity. More granular or objective measures (e.g., wearable fitness trackers) could provide better insights.

Homogeneity of the Sample:

The study sample may not fully represent diverse populations in terms of age, health conditions, or lifestyle, which may limit the generalizability of the findings to broader populations. Model Assumptions:

The linear regression model assumes a linear relationship between predictors and the outcome. If these relationships are non-linear or involve complex interactions, a simple linear model may not capture them adequately.

8.8.3 Next Steps

Model Refinement:

Explore more complex machine learning models, such as decision trees, random forests, or gradient boosting machines, which may capture non-linear relationships and interactions between predictors more effectively. Feature Engineering:

Consider introducing interaction terms between variables (e.g., between BMI and vigorous activity), or even non-linear transformations of key predictors, such as using BMI as a quadratic term or applying logarithmic transformations to physical activity.

Including Additional Predictors:

Collect and incorporate additional variables that may better explain cholesterol levels, such as dietary intake, medication usage, or genetic information (e.g., family history of cholesterol-related conditions). Investigating the Relationship Between Physical Activity and Cholesterol:

Consider investigating more granular measurements of physical activity, such as average minutes per day or intensity levels. This could help assess whether the number of vigorous days per week is the most appropriate metric for modeling cholesterol levels. Longitudinal or Experimental Studies:

A longitudinal study or a randomized controlled trial (RCT) would be ideal to better understand the causal relationship between physical activity and cholesterol levels over time. These designs could better isolate the effects of physical activity from confounding variables and provide more robust evidence. Addressing Missing Data More Effectively:

Rather than using single imputation, explore more advanced imputation techniques or even data augmentation methods to better handle missing data and improve model accuracy.

8.8.4 Reflection

Had I known at the start of Study 2 what I have learned through the process, I would have approached the analysis with a broader perspective. While the focus was on using linear regression, I would have initially considered a wider range of potential predictors that could better explain the variability in cholesterol levels. Incorporating more comprehensive data, such as dietary habits, medication usage, or family medical history, might have added crucial information to the model and improved its explanatory power. Additionally, I would have conducted more thorough research into the existing literature to identify other factors that may have influenced cholesterol levels, ensuring that the most relevant predictors were included from the beginning.

Furthermore, I would have sought a larger and more diverse sample to improve the model’s generalizability. This could involve reaching out to different populations or using more advanced data collection techniques to ensure a representative sample. In hindsight, I would also have explored more advanced methods for handling missing data. These changes could have potentially led to a more accurate and insightful analysis of the relationship between physical activity and cholesterol levels.

8.8.5 Reasoning for the Dashboard

In this study, I decided to split the dashboard into two primary sections: Data Overview and Model. This structure allows for a clear and concise presentation of both the dataset’s key features and the predictive modeling results, offering users a comprehensive understanding of the data and insights drawn from the analysis.

Data Overview The first section, Data Overview, provides an at-a-glance summary of the study participants and key variables, such as BMI, cholesterol levels, and vigorous activity days. The dashboard presents these summaries using value boxes and visualizations, including histograms and bar charts, to facilitate quick understanding of the dataset’s distribution. By highlighting important statistics (e.g., average BMI, cholesterol levels, and activity days), this section gives users a snapshot of the data, allowing them to grasp key trends before delving into the predictive modeling.

The categorical summaries (age groups and smoking status) are visualized with bar charts, offering insights into the demographic distribution of the study population. The use of color-coded value boxes for the number of participants, healthy cholesterol levels, and average vigorous activity days helps draw attention to important figures that define the sample.

Model The second section, Model, presents the results from two linear regression models. This section is tailored for a more technical audience, providing detailed insights into the modeling process, the significance of the variables included, and their respective coefficients.

This visualizes the residual plots for both models, helping users assess model fit and interpret the results. Additionally, a model comparison table presents key performance metrics for each model, allowing users to make informed decisions about which model best predicts cholesterol levels.

The following code saves the data that I will use for this section in the dashboard.

subset_model_data <- data.frame(
  Fitted_Values = subset_model$fitted.values,
  Residuals = residuals(subset_model)
)

# For the full model
full_model_data <- data.frame(
  Fitted_Values = full_model$fitted.values,
  Residuals = residuals(full_model)
)

# Saving both dataframes as CSV files
write.csv(subset_model_data, "subset_model_data.csv", row.names = FALSE)
write.csv(full_model_data, "full_model_data.csv", row.names = FALSE)

9 Session Information

xfun::session_info()
R version 4.4.2 (2024-10-31)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.5

Locale: en_US.UTF-8 / en_US.UTF-8 / en_US.UTF-8 / C / en_US.UTF-8 / en_US.UTF-8

Package version:
  askpass_1.2.1        backports_1.5.0      base_4.4.2          
  base64enc_0.1.3      bit_4.5.0.1          bit64_4.5.2         
  boot_1.3-31          broom_1.0.7          broom.mixed_0.2.9.6 
  bslib_0.8.0          cachem_1.1.0         caret_6.0-94        
  cellranger_1.1.0     class_7.3-22         cli_3.6.3           
  clipr_0.8.0          clock_0.7.1          coda_0.19.4.1       
  codetools_0.2-20     colorspace_2.1-1     compiler_4.4.2      
  cpp11_0.5.1          crayon_1.5.3         crosstalk_1.2.1     
  curl_6.0.1           data.table_1.16.2    DescTools_0.99.58   
  diagram_1.6.5        digest_0.6.37        dplyr_1.1.4         
  DT_0.33              e1071_1.7-16         evaluate_1.0.1      
  Exact_3.3            expm_1.0-0           fansi_1.0.6         
  farver_2.1.2         fastmap_1.2.0        fontawesome_0.5.3   
  forcats_1.0.0        foreach_1.5.2        fs_1.6.5            
  furrr_0.3.1          future_1.34.0        future.apply_1.11.3 
  generics_0.1.3       ggplot2_3.5.1        gld_2.6.6           
  glmnet_4.1-8         globals_0.16.3       glue_1.8.0          
  gower_1.0.1          graphics_4.4.2       grDevices_4.4.2     
  grid_4.4.2           gridExtra_2.3        gtable_0.3.6        
  hardhat_1.4.0        haven_2.5.4          highr_0.11          
  hms_1.1.3            htmltools_0.5.8.1    htmlwidgets_1.6.4   
  httpuv_1.6.15        httr_1.4.7           ipred_0.9-15        
  isoband_0.2.7        iterators_1.0.14     jomo_2.7-6          
  jquerylib_0.1.4      jsonlite_1.8.9       jtools_2.3.0        
  KernSmooth_2.23.24   knitr_1.49           labeling_0.4.3      
  later_1.4.1          lattice_0.22-6       lava_1.8.0          
  lazyeval_0.2.2       lifecycle_1.0.4      listenv_0.9.1       
  lme4_1.1-35.5        lmom_3.2             lubridate_1.9.3     
  magrittr_2.0.3       MASS_7.3-61          Matrix_1.7-1        
  memoise_2.0.1        methods_4.4.2        mgcv_1.9.1          
  mice_3.17.0          mime_0.12            minqa_1.2.8         
  mitml_0.4-5          ModelMetrics_1.2.2.2 munsell_0.5.1       
  mvtnorm_1.3-2        nlme_3.1-166         nloptr_2.1.1        
  nnet_7.3-19          numDeriv_2016.8.1.1  openssl_2.2.2       
  ordinal_2023.12.4.1  pan_1.9              pander_0.6.5        
  parallel_4.4.2       parallelly_1.40.1    pillar_1.9.0        
  pkgconfig_2.0.3      plotly_4.10.4        plyr_1.8.9          
  prettyunits_1.2.0    pROC_1.18.5          prodlim_2024.06.25  
  progress_1.2.3       progressr_0.15.1     promises_1.3.2      
  proxy_0.4-27         purrr_1.0.2          R6_2.5.1            
  rappdirs_0.3.3       RColorBrewer_1.1.3   Rcpp_1.0.13-1       
  RcppEigen_0.3.4.0.2  readr_2.1.5          readxl_1.4.3        
  recipes_1.1.0        rematch_2.0.0        reshape2_1.4.4      
  rlang_1.1.4          rmarkdown_2.29       rootSolve_1.8.2.4   
  rpart_4.1.23         rstudioapi_0.17.1    sandwich_3.1.1      
  sass_0.4.9           scales_1.3.0         shape_1.4.6.1       
  splines_4.4.2        SQUAREM_2021.1       stats_4.4.2         
  stats4_4.4.2         stringi_1.8.4        stringr_1.5.1       
  survival_3.7-0       sys_3.4.3            tibble_3.2.1        
  tidyr_1.3.1          tidyselect_1.2.1     timechange_0.3.0    
  timeDate_4041.110    tinytex_0.54         tools_4.4.2         
  tzdb_0.4.0           ucminf_1.2.2         utf8_1.2.4          
  utils_4.4.2          vctrs_0.6.5          viridisLite_0.4.2   
  vroom_1.6.5          withr_3.0.2          xfun_0.49           
  yaml_2.3.10          zoo_1.8.12